datastage sort best performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

datastage sort best performance

Post by wuruima »

dear all,
I would like to sort a file of more than 80,000,000 records.

i find that if i use one file(named file1) to save all these records, and build a job(4 nodes in cfg file) to read/sort/sum result/output, the job takes long time to read. However, if i break this file1 to 4 files(file1,file2,file3,file4), every file contains 20,000,000 records, i find it takes less time to load all the records in the job.
May i know if anyone can provide the best practise to sort a big file? e.g. use how many Aggregator stages to sort or break the big file to how many small files to read.

End.

btw, this is my first post i find i cannot upload my local pic, for more info...

thanks.
wuruimao
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

First of all, since you say "file" do you mean "sequential file" or a database table?

Reading such a large number of records from a sequential file is fairly unusual, which is why I ask.

In the event that your data IS in a sequential file, then yes, you'll see significantly faster "load times" by breaking the file up into pieces to match your partitioning. This enables the job to read "in parallel" and will dramatically improve performance. If the file uses fixed length records there are other means to improve performance by adding additional "readers", but for the most part, splitting it up is the best way to go.

Using aggregator stages or any other means to break up the file after you've read it isn't the solution. Once you've gotten the file "read" the data is already split up by partition.

Also - the easiest way to post a picture on this website is via a URL that links to an image-hosting site. For the most part we don't use them so that people that aren't on broadband can still use the site and get great performance.

Welcome to the DSXChange! If you need more information please post more details on your issue and someone will be glad to help you along!
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could avoid splitting the file by using the "multiple readers per node" capability. Use Sort stage to sort the individual partitions (partition by the first sort key so that results are correct), and perhaps allocate more than the default amount of memory per node to reduce spillage to disk.

If you require the results to go back into a text file, use a Sort Merge collector specifying the same sort key(s); there's no need to re-sort.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

Thanks for your reply, yes the big file is a sequential file.
I did some testing to get the best practise, and find that if I split this big file to 4 small files, and use 4 aggregator stages to do the pre sort/sum for each file after reading, and then use funnel to collect all the 4 links and use the fifth aggregator stage to sort/sum, to get the final result, it will be much faster than only using 1 aggregator to do the sort/sum, I guess it's because the pre-sort/sum avoid sorting 80,000,000 records directly.
not sure if i make it clear, detail as below
e.g.
F1(20,000,000)
F2(20,000,000) funnel 80,000,000 aggregator ->result file
F3(20,000,000)
F4(20,000,000)

compare to

F1(20,000,000) aggregator 5,000,000
F2(20,000,000) aggregator 5,000,000 funnel 20,000,000 aggregator ->result file
F3(20,000,000) aggregator 5,000,000
F4(20,000,000) aggregator 5,000,000

i find the second solution is much better then the first one (on the sort and sum performance) Is this case normal?
wuruimao
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

ray.wurlod wrote:You could avoid splitting the file by using the "multiple readers per node" capability. Use Sort stage to sort the individual partitions (partition by the first sort key so that results are correct), ...
I find that, use the sort function in Aggregator instead of using a sort stage, perform better.
wuruimao
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Did you increase the memory using the Sort stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

In addressing the "is this normal" part of the question...

What you are doing is quadrupling the number of processes working on the problem by using four aggregators.

For example, assuming you are running an 8-node config file, you'll have 32 processes running the aggregations (8 for each of the aggregations), instead of only 8 processes (for the single aggregation). If you have the CPU, disk and memory to support that additional load, then yes it will run faster.

However, I suspect you'd see the same performance results using a single aggregator, multiple readers, and a config file with four times as many nodes, which would still give four times more processes performing work.

And I agree with Ray, (what a surprise... :-)) increase the memory on the Sort stage and it should always be equal or better to an "in-line" sort in terms of performance.
Last edited by asorrell on Wed Nov 06, 2013 11:36 am, edited 1 time in total.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Never mind the fact that the Aggregator stages would then have much less work to do. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

ray.wurlod wrote:Did you increase the memory using the Sort stage? ...
:o i didn't change any stage setting for my jobs, let me have a try..
wuruimao
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

How are you building your source sequential file? From your explanation it, I am assuming that you are building the file. Are you using DataStage to build it. If you are using DataStage to build the sequential file, have you considered using a data set instead of a sequential file?
Post Reply