Lookup Performance, Table or Hash

endyl13 · Post by **endyl13** » Wed Mar 05, 2003 6:54 am

Hi All,

In one of our job, we need to detect the new rows that come and generate the surrogate key for those rows. The thing is that, the 'lookup' table are quite big, more than 1M rows. The data can come in multiple sources, and we have a field SOURCE_ID so that for each job that handle different source will use 'WHERE SOURCE_ID = SourceName' as condition to minimize the number of rows to lookup.

I am wondering if it's better to do the lookup by using the Hash file. So I will export all the rows from the Lookup table with the 'WHERE SOURCE_ID = SourceName' condition to the hash file, and then do the lookup to the hash file instead.

Will the performance improved significantly? Because right now, we are accessing one table only, and if we run multiple sources at the same time, the performance will be very slow.

Kindly Advised [:o)]

Regards

ND

ariear · Post by **ariear** » Wed Mar 05, 2003 3:19 pm

ND,

Here are some considerations :
1.Does the lookup table reside on the DataStage tier ? Usually the DataBase is remote so the lookup is being done over the net , However the DataBase Tier is stronger than the ETL Tier.

2.Can the lookup data change while lookup job is running - if so it'll be difficult to take a static 'snap shot' and build a hash file

3.Is the row size of the lookup file large - right now you can only cash a hash file of 999MB only

So, If the created hash file fits the cache - select from the DataBase only the rows and columns you're interested in for the lookup(Iconv any TimeStamps) and usually the performence will be Better even if you'll sum the times of extracting the data from the DataBase , Transferring it to another tier , populating the Hash file and running the lookup job. Remeber that if the table is static while in loading process (usually in DWH) you start creating the lookups the minute the Load Phase starts and by the time the You'll get to the lookup jobs they'll be ready

ArieAR

vmcburney · Post by **vmcburney** » Wed Mar 05, 2003 5:43 pm

I think there is a setting within the DataStage Administrator for the hash file read/write cache size. You may need to do some sums on the size of your hash file and bump up your cache size to make it fit.

Would multiple jobs using the same hash file use the same cache?

Vincent McBurney
Data Integration Services
www.intramatix.com

endyl13 · Post by **endyl13** » Wed Mar 05, 2003 8:05 pm

Hi Vincent & Arie,

The database is in the same machine as the DS server.
The lookup data for one source will not be changed during lookup process for that source. In the lookup table, SOURCE_ID is part of the key.
During lookup, if we're going to use hash file, one hash file will be used for one source.
I think I will try to create the hash file during the Load process, just like what Are suggested.
But, is there any option or 'limitaion' that I should be aware of about Hash File, other than file size? Like Cache Size, etc.etcc

Thanks

Regards

ND

azens · Post by **azens** » Sun Mar 09, 2003 11:55 pm

Hi,

You can do analysis on your hash file first, find out the optimal type and size for your hash file. I have 2 hash files for generating surrogate no.. 1 hash file has 8 millions of records and another has 5.5 millions. Both of them work well. I suggest you using static file and rebuild it after batch jobs finished to guarantee the stability of hash file.

regards,
Azens