Page 1 of 1

Hashing algorithm in Link Partitiner

Posted: Sun Mar 12, 2006 11:54 pm
by ravij
Hi,

I am doing some performance tunning in one job. For that I am using Link Partitioner stage for partitioning the data. In this if I use Round Robin algorithm its running fine. But when i use Hash algorithm and in the link collector stage using Sort/Merge, job is running long time. What could be the problem.Is it necessary to sort the data before hash partitioning it?
My question may be somewhat lengthy but please give me solution patiently.
My job design:

seqfile--->LinkPartitioner-->3 XFM stages --> Linkcollector-->DB2

thanks in advance.

Posted: Mon Mar 13, 2006 12:17 am
by rasi
Hi Ravi

Sort always has overhead while running job and depends on the volumes. Is there a specific need to sort the data before sending it to DB2.?

Posted: Mon Mar 13, 2006 1:21 am
by kumar_s
Sort is not necessary for partition.
The issue may be with data. If you apply the hash partiton based on the key you specified, it may likely to divide the data into three partiton, but not equally. May be more or all the data may fall under single partiton. Round robin is always good to split the records equally (more or less) to all the partition when compared to hash (Unless otherwise required).

Hashing algorithm in Link Partitiner

Posted: Mon Mar 13, 2006 3:16 am
by ravij
Hi Rasi,

thanks for reply. there is no need to sort the data. Just I am splitting the data into 3 transformer stages and collecting into one db2 table using Link Collector. I want to improve performance. What is the performance overhead using Hashing algorithm?

thanks Kumar. I am using 2 Transformer stages b/w Link Partitioner and Link collector stage. when I run the job with 10 records and using Hashing alogorithm with key col is PK. Its distributed the records like 7 recs to 1 XFM stage, 1 rec to 2nd XFM and 2 recs to 3rd XFM stage. How its dirstributing the records? How many groups will it create by default?

please give me the solution patiently.
thanks in advance.

Posted: Mon Mar 13, 2006 4:14 am
by kumar_s
DataStage has its inbuilt hashing algorithm. It applies to the field you supply. Now the record is distributed based on the reminder/resultant. And it divides to the number of partiton applied. It can be something like all the numbers ends with 2,4,8 may go to 1st partition and the some kind of odd numbers goes to 2nd partiton and so on....
You should get more insight if you go through the documentation provided for parallel jobs.

Posted: Mon Mar 13, 2006 6:09 am
by ray.wurlod
Is it really necessary to collect the rows together before inserting into DB2 table? Why not have three parallel streams loading DB2? If the keys are unique (which they will be if you've partitioned on the key column) there will be no contention for locks.