DSXchange

Posted: **Mon Nov 04, 2013 10:33 pm**

dear all,
I would like to sort a file of more than 80,000,000 records.

i find that if i use one file(named file1) to save all these records, and build a job(4 nodes in cfg file) to read/sort/sum result/output, the job takes long time to read. However, if i break this file1 to 4 files(file1,file2,file3,file4), every file contains 20,000,000 records, i find it takes less time to load all the records in the job.
May i know if anyone can provide the best practise to sort a big file? e.g. use how many Aggregator stages to sort or break the big file to how many small files to read.

End.

btw, this is my first post i find i cannot upload my local pic, for more info...

thanks.

Posted: **Tue Nov 05, 2013 10:12 am**

First of all, since you say "file" do you mean "sequential file" or a database table?

Reading such a large number of records from a sequential file is fairly unusual, which is why I ask.

In the event that your data IS in a sequential file, then yes, you'll see significantly faster "load times" by breaking the file up into pieces to match your partitioning. This enables the job to read "in parallel" and will dramatically improve performance. If the file uses fixed length records there are other means to improve performance by adding additional "readers", but for the most part, splitting it up is the best way to go.

Using aggregator stages or any other means to break up the file after you've read it isn't the solution. Once you've gotten the file "read" the data is already split up by partition.

Also - the easiest way to post a picture on this website is via a URL that links to an image-hosting site. For the most part we don't use them so that people that aren't on broadband can still use the site and get great performance.

Welcome to the DSXChange! If you need more information please post more details on your issue and someone will be glad to help you along!

Posted: **Tue Nov 05, 2013 8:33 pm**

You could avoid splitting the file by using the "multiple readers per node" capability. Use Sort stage to sort the individual partitions (partition by the first sort key so that results are correct), and perhaps allocate more than the default amount of memory per node to reduce spillage to disk.

If you require the results to go back into a text file, use a Sort Merge collector specifying the same sort key(s); there's no need to re-sort.

Posted: **Tue Nov 05, 2013 9:47 pm**

Thanks for your reply, yes the big file is a sequential file.
I did some testing to get the best practise, and find that if I split this big file to 4 small files, and use 4 aggregator stages to do the pre sort/sum for each file after reading, and then use funnel to collect all the 4 links and use the fifth aggregator stage to sort/sum, to get the final result, it will be much faster than only using 1 aggregator to do the sort/sum, I guess it's because the pre-sort/sum avoid sorting 80,000,000 records directly.
not sure if i make it clear, detail as below
e.g.
F1(20,000,000)
F2(20,000,000) funnel 80,000,000 aggregator ->result file
F3(20,000,000)
F4(20,000,000)

compare to

F1(20,000,000) aggregator 5,000,000
F2(20,000,000) aggregator 5,000,000 funnel 20,000,000 aggregator ->result file
F3(20,000,000) aggregator 5,000,000
F4(20,000,000) aggregator 5,000,000

i find the second solution is much better then the first one (on the sort and sum performance) Is this case normal?

Posted: **Wed Nov 06, 2013 3:01 am**

ray.wurlod wrote:You could avoid splitting the file by using the "multiple readers per node" capability. Use Sort stage to sort the individual partitions (partition by the first sort key so that results are correct), ...

I find that, use the sort function in Aggregator instead of using a sort stage, perform better.

Posted: **Wed Nov 06, 2013 7:45 am**

Did you increase the memory using the Sort stage?

Posted: **Wed Nov 06, 2013 9:14 am**

In addressing the "is this normal" part of the question...

What you are doing is quadrupling the number of processes working on the problem by using four aggregators.

For example, assuming you are running an 8-node config file, you'll have 32 processes running the aggregations (8 for each of the aggregations), instead of only 8 processes (for the single aggregation). If you have the CPU, disk and memory to support that additional load, then yes it will run faster.

However, I suspect you'd see the same performance results using a single aggregator, multiple readers, and a config file with four times as many nodes, which would still give four times more processes performing work.

And I agree with Ray, (what a surprise...

) increase the memory on the Sort stage and it should always be equal or better to an "in-line" sort in terms of performance.

Posted: **Wed Nov 06, 2013 10:27 am**

Never mind the fact that the Aggregator stages would then have much less work to do.

Posted: **Thu Nov 07, 2013 8:20 pm**

ray.wurlod wrote:Did you increase the memory using the Sort stage? ...

i didn't change any stage setting for my jobs, let me have a try..

Posted: **Mon Nov 11, 2013 9:57 am**

How are you building your source sequential file? From your explanation it, I am assuming that you are building the file. Are you using DataStage to build it. If you are using DataStage to build the sequential file, have you considered using a data set instead of a sequential file?

DSXchange

datastage sort best performance

datastage sort best performance