Hash look up Very Slow...

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
tkbharani
Premium Member
Premium Member
Posts: 71
Joined: Wed Dec 27, 2006 8:12 am
Location: Sydney

Hash look up Very Slow...

Post by tkbharani »

Dear All

I have server job Which has primary link as Sequential file of 7GB and having a Lookup file(hash file) of 5 crores records of 10GB. That is hash file as lookup file. Finally after lookup job will create a sequentila file as output.
Hash file properties
-------------------
Type 30 Dynamic. 64BIT resized
size : 4 crores,10 GB
columns : 3 columns

Transformation:
---------------------
Simple ICONV/OCONV.
Problem : This job is taking more time in processing. only 300 rows per second is processed. what will be the bottleneck for this processing ?
Thanks, BK
xanupam
Participant
Posts: 6
Joined: Sun Nov 25, 2007 11:10 am
Location: India

Post by xanupam »

the bottleneck is the size of the hashfile itself.

See if you could minimise the size of the hash file. or create 2 small hash files for look up instead of 1. Using of caching would also enhance the performance provided the hash file is small.
Cheers !!!
An S
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Try running it with a smaller hashed file - just build it with (say) 5000 rows. Obviously your job won't work properly, but we're interested to see if the large hashed file is causing the problem.

If it is a lot faster with the smaller hashed file, you can try setting the Minimum Modulus on the hashed file to make the storage a bit more efficient. I think that mainly helps the speed of writing the file more so than reading it though.

It may be that the file is thrashing. ie. It cannot fit into memory in its entirity and bits of it are still on disk. Every time you lookup on a row that is not in memory, you have to swap out a bit that IS in memory. This problem is self-perpetuating.

The solution is not to use hashed files. The alternative would be to pre-sort your two sources and use a MERGE stage. Alternatively you could get them both on a database and use the database engine to perform the join.
Ross Leishman
tkbharani
Premium Member
Premium Member
Posts: 71
Joined: Wed Dec 27, 2006 8:12 am
Location: Sydney

hash file can't be reduced.

Post by tkbharani »

the size oft he hash file can't be reduced as we need it,it can;t be divided into also.
Will data stage 7.1 server job supports this much hash file,do i have to tune hash file from server side.
Thanks, BK
xanupam
Participant
Posts: 6
Joined: Sun Nov 25, 2007 11:10 am
Location: India

Post by xanupam »

Better use other methods for lookup, hash file would not give u good performance,

You could try creating a table which could be used for look up or can use unix level files based joins.
Cheers !!!
An S
xanupam
Participant
Posts: 6
Joined: Sun Nov 25, 2007 11:10 am
Location: India

Post by xanupam »

Better use other methods for lookup, hash file would not give u good performance,

You could try creating a table which could be used for look up or can use unix level files based joins.
Cheers !!!
An S
tkbharani
Premium Member
Premium Member
Posts: 71
Joined: Wed Dec 27, 2006 8:12 am
Location: Sydney

Post by tkbharani »

I tried hash file with 1,00000 records,the performance is good.it is extracting at a speed of 6000 per sec. This confirms the hash file is of high size and some tuning must be done on this. Is there any way doing it in datastge, rather than going towards DataBase. Databse merge statement will take more time.
Thanks, BK
xanupam
Participant
Posts: 6
Joined: Sun Nov 25, 2007 11:10 am
Location: India

Post by xanupam »

Datastage - You could try routines for joining two files, not too sure abt this option.

Unix level - Definately you could have join statements between 2 files and create the merged file. You could trgger this script from DataStage.
Cheers !!!
An S
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Consider using a static hashed file for very large sizes. Try initially with type 18. Keep the separation small, perhaps 2 or 4.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tkbharani
Premium Member
Premium Member
Posts: 71
Joined: Wed Dec 27, 2006 8:12 am
Location: Sydney

Post by tkbharani »

ray.wurlod wrote:Consider using a static hashed file for very large sizes. Try initially with type 18. Keep the separation small, perhaps 2 or 4. ...
Currently we have created HashFile using Type30 Dynamic and then Resize hash file from 32Bit to 64 Bit. Its not having good performence(300rows/per sec)


1. If we Create StaticFile , what will happen if my hash file size increase day-by-day. do I have to increase my hashfile when data increases ?
2.Incase of Static file can you please suggest how to do it for 10GB hash file.
will it suppport ?
Thanks, BK
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

rleishman wrote:The solution is not to use hashed files. The alternative would be to pre-sort your two sources and use a MERGE stage.
Did you read this?
Ross Leishman
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That could be *a* solution. However, from what I recall the Server MERGE stage uses hashed structures under the covers to perform that operation so I doubt it will work much more better. Me, more than likely I'd bulk load the source file into my database and let it do the work from there.

As to the static hashed file questions, first suggestion would be to simply try it and see if it seems to buy you anything.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

With a hashed file that you are not re-creating and which is growing, prefer dynamic. But do try re-creating, see how long it takes to populate the 10GB.

I am surprised at your assertion that it is the hashed file lookup that is slow. You have a row size of less than 30 bytes, which should mean quite an efficient lookup.

Therefore please undertake this test: create the same job without the lookup. Let us know how long that takes (NOT rows/sec, elapsed time). Then re-introduce the lookup, and report that time.

You might also like to gather statistics on the Transformer stage (Tracing tab, Job Run Options dialog) to discover where it is spending most of its time. Post those results, too.

Finally, run ANALYZE.FILE hashedfile STATS against your hashed file, and post those results, so we can see how well/badly tuned the hashed file is.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

chulett wrote:That could be *a* solution. However, from what I recall the Server MERGE stage uses hashed structures under the covers to perform that operation so I doubt it will work much more better.
And THAT's why he's a guru! I agree, the documentation bears you out on that as well. Unlike the Parallel Merge stage, there is no requirement for pre-sorted data in the Server Merge. In fact, the doco obliquely references hashed file locations.

But this still doesn't mean that hashed files are scaleable beyond the limitations of your server's memory. The bigger the hashed file, the less of it will fit into memory, and the greater the likelihood that any given row will NOT be in memory when you need it. Inevitably, performance must surely degrade further and further.

Idea 1: Now logically, if you were to build your hashed file and then SORT the input data on the hash key, then whenever you had two or more rows with the same lookup key, you could be certain that the lookup row would be cached.

How would the benefit here would trade off against the cost of the sort? I have no idea.

Idea 2:Similar to @xanupam's suggestion, does the COMMAND stage exist in v7.1? Perhaps you could use the Unix JOIN command to join the datasets, but have it embedded in the DataStage job rather than a separate command.
Ross Leishman
tkbharani
Premium Member
Premium Member
Posts: 71
Joined: Wed Dec 27, 2006 8:12 am
Location: Sydney

Post by tkbharani »

Thanks Mr.Ray and all

I found the bottleneck. The hash file which I am using for lookup,was also used by some other applications to read parallel. when I ran only lookup job, job totally took only 60Min to get completed. I think this may be the reason. Any way my job gets completed within time.
Thank You All
Thanks, BK
Post Reply