Page 1 of 1

How to replace sort stage for huge volume data?

Posted: Sat Jan 25, 2014 5:21 am
by swapna07
Hi All,

I have a job which is processing around 89 million records. The job desgin is like this :

Seq--->Transf--->funnel--->sortstage--->Transf--- has 2 output---(seq,xml)

until sortstage it takes 30mins. But for sorting this huge volume data it takes 2 hrs 15 mins. :cry:
Can anyone help to reduce execution time to almost half?

Thanks in advance.

Posted: Sat Jan 25, 2014 8:46 am
by chulett
Pretty arbitrary requirement - cut the time in half. Seems to me you'd have to have access to a faster sort, something third party like SyncSort perhaps to accomplish that.

How many nodes does the job run on? You could try experimenting with that... if your source file allows parallel reads that may help.

Posted: Sun Jan 26, 2014 8:04 am
by swapna07
This job runs on 8 nodes. I tried to increase the buffer size in sort stage by setting auto-buffer , but it is not helping. Actually this is one job of the whole interface, whereas the whole interface is taking 3hr 45 mins to process 88 million records, this job itself is consuming most of it. Business has reverted asking to reduce job run time. I really don't know what to do!! :cry: :cry:

Let me know in case you can help. Thanks in advance.

Posted: Sun Jan 26, 2014 11:05 am
by prasannakumarkk
Can you tell what you want to acheive by sorting data? Total number of keys in sort stage? What is the partition type used?

Posted: Mon Jan 27, 2014 7:25 am
by eph
Hi,

I don't know if you tried to use the "Restrict memory usage" option in the sort stage? You could give a higher value in order to reduce the amount of data landing on discs.

Did you check the job score to verify if no inline sort is inserted automatically at runtime?

Also verify the i/o of the sorting volume configured in the apt_config_file.

Eric