Performance issues with Hash files.

DSmamta · Post by **DSmamta** » Tue Feb 01, 2005 9:55 am

Hi All:

We have done a couple of hash lookups on our server jobs...however there are millions and millions of rows and hence it takes for ever. Need to improve performance.

My question: Is it better to join files or do a Merge on key columns instead of Hash lookups....?

Whichever is better can you give an Example on how to do the same.

Thanks,
MJ

kcbland · Post by **kcbland** » Tue Feb 01, 2005 9:59 am

Please describe your job design and row counts. Your question is too vague. Hash files can be extremely efficient if utilized correctly. In addition, multiple cpu servers can be under utilized if you do not build a job in a fashion to support multiple cpus. Either thru IPC or job instances you can divide and conquer the data.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue Feb 01, 2005 10:25 am

In theory, joining is better if your source and reference are both large and almost identical in size.

Having said that, you may be able to reduce the size of your reference using partition or other mechanism.

DSmamta · Post by **DSmamta** » Tue Feb 01, 2005 1:11 pm

hi kcbland:

At the moment we are only using one CPU we havent gone parallel yet using server. Doing two hash lookup's based on two key's from different tables. In both lookup's there is a common key lookup.

Job Design:

Two Hash's going into transformer. Seq file (Input) going into Transformer and an Seq File (output) coming out of transformer.

The row's go up to 189 million and some more.

Thanks,
Mamta

kcbland · Post by **kcbland** » Tue Feb 01, 2005 1:18 pm

Are the hash files containing more columns than you absolutley need? This is just wasting cpu time reading all characters in a row of data. Investigate eliminating solumns from your hash lookups.

Did you look at the data and overflow files for the hash lookups? A lot of data inthe overflow portion reduces the optimized lookup effect of a hash file. Consider setting a minimum modulus when creating the hash files so that the file is not undersized once populated.

Your design is perfect for multiple instances of the job dividing the source data. By running 4 instances of the job, each using a constraint to look at 1/4th the data ( MOD(@INROWNUM,4) = 0,1,2, or 3) will allow you to use all 4 cpus. This is better than using IPC in your situation.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 01, 2005 4:21 pm

Are you telling us that there are millions of rows of data to be processed, or millions of rows in the hashed files? If the latter, some gains may be able to be had by tuning the hashed files.

This is a highly skilled task, and I would recommend hiring a competent consultant to undertake it.