Hashing algorithm in Link Partitiner

ravij · Post by **ravij** » Sun Mar 12, 2006 11:54 pm

Hi,

I am doing some performance tunning in one job. For that I am using Link Partitioner stage for partitioning the data. In this if I use Round Robin algorithm its running fine. But when i use Hash algorithm and in the link collector stage using Sort/Merge, job is running long time. What could be the problem.Is it necessary to sort the data before hash partitioning it?
My question may be somewhat lengthy but please give me solution patiently.
My job design:

seqfile--->LinkPartitioner-->3 XFM stages --> Linkcollector-->DB2

thanks in advance.

rasi · Post by **rasi** » Mon Mar 13, 2006 12:17 am

Hi Ravi

Sort always has overhead while running job and depends on the volumes. Is there a specific need to sort the data before sending it to DB2.?

kumar_s · Post by **kumar_s** » Mon Mar 13, 2006 1:21 am

Sort is not necessary for partition.
The issue may be with data. If you apply the hash partiton based on the key you specified, it may likely to divide the data into three partiton, but not equally. May be more or all the data may fall under single partiton. Round robin is always good to split the records equally (more or less) to all the partition when compared to hash (Unless otherwise required).

ravij · Post by **ravij** » Mon Mar 13, 2006 3:16 am

Hi Rasi,

thanks for reply. there is no need to sort the data. Just I am splitting the data into 3 transformer stages and collecting into one db2 table using Link Collector. I want to improve performance. What is the performance overhead using Hashing algorithm?

thanks Kumar. I am using 2 Transformer stages b/w Link Partitioner and Link collector stage. when I run the job with 10 records and using Hashing alogorithm with key col is PK. Its distributed the records like 7 recs to 1 XFM stage, 1 rec to 2nd XFM and 2 recs to 3rd XFM stage. How its dirstributing the records? How many groups will it create by default?

please give me the solution patiently.
thanks in advance.

kumar_s · Post by **kumar_s** » Mon Mar 13, 2006 4:14 am

DataStage has its inbuilt hashing algorithm. It applies to the field you supply. Now the record is distributed based on the reminder/resultant. And it divides to the number of partiton applied. It can be something like all the numbers ends with 2,4,8 may go to 1st partition and the some kind of odd numbers goes to 2nd partiton and so on....
You should get more insight if you go through the documentation provided for parallel jobs.

ray.wurlod · Post by **ray.wurlod** » Mon Mar 13, 2006 6:09 am

Is it really necessary to collect the rows together before inserting into DB2 table? Why not have three parallel streams loading DB2? If the keys are unique (which they will be if you've partitioned on the key column) there will be no contention for locks.