Lookup Performance, Table or Hash

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
endyl13
Premium Member
Premium Member
Posts: 44
Joined: Tue Jun 25, 2002 12:47 am
Location: Qatar
Contact:

Lookup Performance, Table or Hash

Post by endyl13 »

Hi All,

In one of our job, we need to detect the new rows that come and generate the surrogate key for those rows. The thing is that, the 'lookup' table are quite big, more than 1M rows. The data can come in multiple sources, and we have a field SOURCE_ID so that for each job that handle different source will use 'WHERE SOURCE_ID = SourceName' as condition to minimize the number of rows to lookup.

I am wondering if it's better to do the lookup by using the Hash file. So I will export all the rows from the Lookup table with the 'WHERE SOURCE_ID = SourceName' condition to the hash file, and then do the lookup to the hash file instead.

Will the performance improved significantly? Because right now, we are accessing one table only, and if we run multiple sources at the same time, the performance will be very slow.

Kindly Advised [:o)]

Regards

ND
ariear
Participant
Posts: 237
Joined: Thu Dec 26, 2002 2:19 pm

Post by ariear »

ND,

Here are some considerations :
1.Does the lookup table reside on the DataStage tier ? Usually the DataBase is remote so the lookup is being done over the net , However the DataBase Tier is stronger than the ETL Tier.

2.Can the lookup data change while lookup job is running - if so it'll be difficult to take a static 'snap shot' and build a hash file

3.Is the row size of the lookup file large - right now you can only cash a hash file of 999MB only

So, If the created hash file fits the cache - select from the DataBase only the rows and columns you're interested in for the lookup(Iconv any TimeStamps) and usually the performence will be Better even if you'll sum the times of extracting the data from the DataBase , Transferring it to another tier , populating the Hash file and running the lookup job. Remeber that if the table is static while in loading process (usually in DWH) you start creating the lookups the minute the Load Phase starts and by the time the You'll get to the lookup jobs they'll be ready

ArieAR
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I think there is a setting within the DataStage Administrator for the hash file read/write cache size. You may need to do some sums on the size of your hash file and bump up your cache size to make it fit.

Would multiple jobs using the same hash file use the same cache?

Vincent McBurney
Data Integration Services
www.intramatix.com
endyl13
Premium Member
Premium Member
Posts: 44
Joined: Tue Jun 25, 2002 12:47 am
Location: Qatar
Contact:

Post by endyl13 »

Hi Vincent & Arie,

The database is in the same machine as the DS server.
The lookup data for one source will not be changed during lookup process for that source. In the lookup table, SOURCE_ID is part of the key.
During lookup, if we're going to use hash file, one hash file will be used for one source.
I think I will try to create the hash file during the Load process, just like what Are suggested.
But, is there any option or 'limitaion' that I should be aware of about Hash File, other than file size? Like Cache Size, etc.etcc

Thanks

Regards

ND
azens
Premium Member
Premium Member
Posts: 24
Joined: Tue Feb 25, 2003 11:59 pm

Post by azens »

Hi,

You can do analysis on your hash file first, find out the optimal type and size for your hash file. I have 2 hash files for generating surrogate no.. 1 hash file has 8 millions of records and another has 5.5 millions. Both of them work well. I suggest you using static file and rebuild it after batch jobs finished to guarantee the stability of hash file.

regards,
Azens
Post Reply