Performance issues with Hash files.

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
DSmamta
Participant
Posts: 23
Joined: Mon Jul 12, 2004 12:53 pm

Performance issues with Hash files.

Post by DSmamta »

Hi All:

We have done a couple of hash lookups on our server jobs...however there are millions and millions of rows and hence it takes for ever. Need to improve performance.

My question: Is it better to join files or do a Merge on key columns instead of Hash lookups....?


Whichever is better can you give an Example on how to do the same.

Thanks,
MJ
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Please describe your job design and row counts. Your question is too vague. Hash files can be extremely efficient if utilized correctly. In addition, multiple cpu servers can be under utilized if you do not build a job in a fashion to support multiple cpus. Either thru IPC or job instances you can divide and conquer the data.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

In theory, joining is better if your source and reference are both large and almost identical in size.

Having said that, you may be able to reduce the size of your reference using partition or other mechanism.
DSmamta
Participant
Posts: 23
Joined: Mon Jul 12, 2004 12:53 pm

Post by DSmamta »

hi kcbland:

At the moment we are only using one CPU we havent gone parallel yet using server. Doing two hash lookup's based on two key's from different tables. In both lookup's there is a common key lookup.

Job Design:

Two Hash's going into transformer. Seq file (Input) going into Transformer and an Seq File (output) coming out of transformer.

The row's go up to 189 million and some more.

Thanks,
Mamta
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Are the hash files containing more columns than you absolutley need? This is just wasting cpu time reading all characters in a row of data. Investigate eliminating solumns from your hash lookups.

Did you look at the data and overflow files for the hash lookups? A lot of data inthe overflow portion reduces the optimized lookup effect of a hash file. Consider setting a minimum modulus when creating the hash files so that the file is not undersized once populated.

Your design is perfect for multiple instances of the job dividing the source data. By running 4 instances of the job, each using a constraint to look at 1/4th the data ( MOD(@INROWNUM,4) = 0,1,2, or 3) will allow you to use all 4 cpus. This is better than using IPC in your situation.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are you telling us that there are millions of rows of data to be processed, or millions of rows in the hashed files? If the latter, some gains may be able to be had by tuning the hashed files.

This is a highly skilled task, and I would recommend hiring a competent consultant to undertake it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply