creteria for opting Hashed file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
mahi_mahi
Charter Member
Charter Member
Posts: 45
Joined: Mon Aug 01, 2005 10:02 am

creteria for opting Hashed file

Post by mahi_mahi »

Hi all,
My requirement is as follows.

There is a query Q1(say) with some set of tables which has 2 filter conditions.
another query Q2 on the same set of tables with one filter condition.(this filter creteria is same in both the queries)

I have to compare both the query result sets.
I feel the query which returns less rows i.e Q1 has to be used as a lookup beacuse of performance reasons.

Please suggest me which Query has to loaded onto the hashed file for lookup.

Let em know if u need any more details.
Last edited by mahi_mahi on Tue Jun 20, 2006 7:50 am, edited 1 time in total.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If you have two data sets that either can be the driver, but the goal is the INTERSECTION of the two (like an INNER join), then consider these scenarios:

#1 Set A has fewer columns (or average characters per row) than B but more rows than B

You want to use the A set as the reference and the B dataset as the driver because referencing a wider hashed file row is probably slower than reading one off a sequential input. The job will reference only a subset of the rows in A, therefore, you avoid unnecessary hashed retrieval of characters of data.

#2 Set A has more columns (or average characters per row) than B and more rows than B, but A does not have exorbitantly more rows than B.

Since the intersection is all that's required, use B as the driver. This means that you only reference the necessary rows from A and therefore only pull out of the hashed file the minimum number of characters necessary.

#3 Set A has more columns (or average characters per row) than B and more rows than B AND A has exorbitantly more rows than B.

Drive from A, because you're going to have a faster reference against B even though throwing away the A rows most of the time. This is fairly fast, but consider that making A the reference will probably incur more time building the hashed file than you gain only referencing a small subset.



The best method for hashed files is to look for opportunities to keep the hashed file small and only containing the minimum rows necessary.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
mahi_mahi
Charter Member
Charter Member
Posts: 45
Joined: Mon Aug 01, 2005 10:02 am

Post by mahi_mahi »

Thanks a lot for the quick response.
I have DatasetA(C1, C2, C3) and very less number of rows than DatasetB(C2, C3)
So I hope A has to be used as a lookup as per my understanding.
Correct me if I am wrong.

I have C1, C2 C3 columns in Set A and C2, C3 in set B.
I have to compare C3 from both the datasets. C1 iss the key column in Set A. But if I have to compare C3, I have to make C3 as key in the hashed file. But if I take C3 as key column I may end up in removing duplicate values for C3 which I dont want to happen.

If I take both C1 and C3 as key columns in the hashed file A, then Can I retain the duplicate values for C3 because the composite key has to be unique.
e.g The data looks like the below:
121, 'xyz'
122, 'abc'
123, 'xyz'

I need both 'xyz's to come into the hashed file.

If a column is made primary key, is it mandatory that I have to have a comparision for the column in the transformer.
What shall I substitute for C1 in the transformer if it is taken as key.
Please suggest me how it works.
mahi_mahi
Charter Member
Charter Member
Posts: 45
Joined: Mon Aug 01, 2005 10:02 am

Post by mahi_mahi »

Can somebody please help me out in this regard.
My question is if there are 2 key columns in a hashed file, do we need to have join condition for both of them.

The situation is i have 2 keys. but I have only one column to compare.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

A hashed file key is unique. If you declared 2 key columns then, in order to do a lookup, both of those columns must contain values. So if you have 2 key columns you need 2 values in order to read a record.
mahi_mahi
Charter Member
Charter Member
Posts: 45
Joined: Mon Aug 01, 2005 10:02 am

Post by mahi_mahi »

Is it that both the key columns are not considered as a composit key?
Are they considered keys seperately?

For value in Column C1 there are duplicate values coming in for C3. I need to have all the values of C3 in the hashed file.
I mean the combination is unique.

In that case how should I proceed?

I have C3 coming in from source which I can do a compare with the Column in Hashed file. But there's no way to compare with C1 of the hashed file because there's no C1 in the source.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

In this case use date set that only has one key defined and use that as a hashed file and use the other dataset with two keys defined as the driver.
By reading through your other posts i believe you have a choice in which data set can to use for lookup. Hence, this is the only other way i see.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
mahi_mahi
Charter Member
Charter Member
Posts: 45
Joined: Mon Aug 01, 2005 10:02 am

Post by mahi_mahi »

DSguru2B wrote:In this case use date set that only has one key defined and use that as a hashed file and use the other dataset with two keys defined as the driver.
By reading through your other posts i believe you have a choice in which data set can to use for lookup. Hence, this is the only other way i see.
But if I go with Dataset having one key column as hashed file then The hashed file has to accomidate many rows when compared to the previous case which in turn may reduce the performance.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Why dont you actually run it and see how much time its taking. Maybe it wont take that much time. Also, build the hashed file with only the columns you require.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
cconley
Premium Member
Premium Member
Posts: 4
Joined: Fri Nov 12, 2004 12:21 am
Contact:

Post by cconley »

mahi_mahi wrote:Thanks a lot for the quick response.
I have DatasetA(C1, C2, C3) and very less number of rows than DatasetB(C2, C3)
So I hope A has to be used as a lookup as per my understanding.
Correct me if I am wrong.

I have C1, C2 C3 columns in Set A and C2, C3 in set B.
I have to compare C3 from both the datasets. C1 iss the key column in Set A. But if I have to compare C3, I have to make C3 as key in the hashed file. But if I take C3 as key column I may end up in removing duplicate values for C3 which I dont want to happen.

If I take both C1 and C3 as key columns in the hashed file A, then Can I retain the duplicate values for C3 because the composite key has to be unique.
e.g The data looks like the below:
121, 'xyz'
122, 'abc'
123, 'xyz'

I need both 'xyz's to come into the hashed file.

If a column is made primary key, is it mandatory that I have to have a comparision for the column in the transformer.
What shall I substitute for C1 in the transformer if it is taken as key.
Please suggest me how it works.
What is the goal of your comparison? Are you trying to find the intersection of the two datasets? Are you trying to compare field values between the two data sets? A comparison requires a baseline, and something to compare against it. There must be some common elements (i.e. columns) to link the two data sets.

When a hased file is used as a reference in a comparison, the hash file can be considered the baseline and the input stream can be the data you are comparing against it. So for example, you may be checking to see if your stream data exists in the hash file; and if it exists, you may be checkign whether some element has changed.

In this scenario, the hash file key must represent the common elements between the two datasets, so if C1 is in Set A but not Set B, then it cannot be in the hash file key.

I think some of the posts suggest using Set B as the reference (hash) and Set A as the stream. In this case you would need to make C2 and C3 keys in the hash file. However, you have now reversed the comparison so it will be essentially opposite what the results you would get if Set A was the reference.

If you would provide more insight as to what you want to do, perhaps i can be of more help
Post Reply