Sorting on huge volume

mathewb · Post by **mathewb** » Wed Jul 18, 2007 10:08 am

Hi,

Searched complete forum and found simlar topics but mine is little different. This is regarding sorting and would like to hear you guys opinion.
I need to pull about 100 million records from Oracle over dblink and create a file sorted based on one column. All data resides in one partition.
Currently I am pulling based on the ranges on the key column from the table (4 paralle pulls ) and funnel it to a transformer and the write it to the flat file. The file is then sorted using unix sort.

Is there any faster way to create this final sorted file.?

Currently it is taking 45 min to pull data and one hr to sort it in unix. Will this be any faster if I do the sorting in ds?
What I an seeing is, the next stage has to wait if I try sorting in any stage (specific sort stage or just sort) since ds reqires all data to do the sorting before writing into file. The output file is in the range of 4- 6GB. So I am little worried about holding all these data in memory for doing the sort in ds server.

Does the order by clause from the Oracle is another way to write to the file in sorted manner? Order by clause can take a long time to do sorting in the oracle side for such a huge volume. Also it becomes sequential.

Pl let me know what you guys think.

Thanks
Mathew

ray.wurlod · Post by **ray.wurlod** » Wed Jul 18, 2007 2:42 pm

If the column(s) on which you are sorting are indexed, then it will be faster to sort in the database. This is because index values are stored in sorted order, so to "sort" simply involves scanning along the leaf nodes of the B-tree index (or straight through a bitmap index).

mathewb · Post by **mathewb** » Wed Jul 18, 2007 5:50 pm

No the table is not indexed. Since the pull from this table is over dblink, it cannot be more than 4000 records/sec. Since all the data resides in one partition of the table, I cannot use any datastage parallelism. I am currently making the 4 parallel pull with ranges (individual stages funneled to a transformer) and it gives me up total of 15,000records per sec reads which is put into the file at that rate. If I sort these records in any stage, it will wait for the complete data pull before writing the file. That makes it totally seqential and memory oriented and takes 4 times the time.
Since the column is not indexed, oracle is alo taking time inorder to do sorting and start streaming the data.
So what I am asking is that, is there anything which is offered by DS in a sort process, which I am missing, which could help in the proper design of this function. I am new to DS ( only one month ) and that's why I am hoping many of you guys have designed similar systems. Your experience can tell upfront that what could be a better design and how to achieve it.

Thanks
Mathew

ray.wurlod · Post by **ray.wurlod** » Wed Jul 18, 2007 6:08 pm

OK, the data come in in serial fashion. You immediately partition on the sort key (using Hash or Modulus as the partitioning algorithm, the latter only if the sort key is an integer sequence) - the Sort stage can then sort in parallel. Also in the Sort stage is a property that allows you to vary the amount of memory per partition to allocate to sorting (before the sort spills to scratch disk).

mathewb · Post by **mathewb** » Tue Jul 24, 2007 7:07 am

Ray,

In any case..the write into file has to wait for all the records to be sorted in the memory. So the process will be slower than my current design unless the sorting in the memory is real fast. Also the process will exaust the server memory since it requires more than 5GB to hold everything in memory. ALso there are other jobs which run in parallel. Do you thing if I specify the 8GB for the node and another 5GB for the scratch disk just for this process, this will not impact any other jobs?.

In current design I specified 4 file stages to pull from same table with different ranges. Can I define one file stage and invoke this 4 times with in a job to pull from the different ranges simultaneously. In case of sequence, if I am invoking the same job 4 times, can the same range parameter be assigned with diff value??

Pls let me know if I haven't expained my questions clearly?

leomauer · Post by **leomauer** » Tue Jul 24, 2007 7:37 am

You said previously that the table is not indexed. Does it mean that to create you ranges you invoke table scan 4 times. I would consider it a waste.
PX approach that Ray recommends should work much better. My experience tells me that it does not wait for the last record to start sorting. It sorts as it reads and only waits for the last record to put it in the right place and start returning rows in the right order.
In your design you really land the file before the sort even begin. But then sort acts exactly the same, waiting for the last record to begin returning rows.
As for impacting the other jobs: You need to sort, right?. No meter what sort software you invoke the same process need to happened - ordering the records. The quicker you want to do it, the more you impact simultaneous processes. So given a need to sort, the resources available it only depends how efficient the sort software manages the resources.
I found that PX sort is as quick as Cosort. And the control over recourses in PX (config files, pools) seems more flexible.
But if you have limited resources, it is up to you to find the right setting.

mathewb · Post by **mathewb** » Fri Jul 27, 2007 8:25 am

Here is the problem
If I runthe select statement, the database is responding only at a speed of 5000records/sec since we have to go over dblink. Which means I can get max of 18million an hour( if I have only one child pulling the data). I sure PX will be able to do the sorting of these records in memory with in the same 1hr. I was just telling it will start writing into the output file only after receiving the last record.. correct.? In order to put the 18 million into the file it take another 6-7 min (rate of ~45000 recs/sec).
My bottleneck is in selecting the data.
what are the different ways to improve this..Right now I am trying to uses external source stage rather than a database stage. In which I can call external programs to pull the data multiple times if needed and stream it. What are the other ways to gain more performance.??

leomauer · Post by **leomauer** » Fri Jul 27, 2007 8:52 am

I do not know what dblink is. But in a "normal" situation if the table in Oracle is partitioned and you can use Oracle Enterprise stage, then you can specify partitioning in this stage and stage will send as many sequel statements as there are partitions. It will correctly add partitioning criteria to it.
Then you can align sorting along the same partitions and specify sort/merge on collection. That will be probably the best performance that can be achieved by job design. The rest is the environment.

mathewb · Post by **mathewb** » Fri Jul 27, 2007 9:14 am

The table is partitioned based on date and the pull needs to be performed for a particular date. This means my pull has all the data in one partition. so the partitioning in stage will not help.
mathewb