sort stage is taking too much of time

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
yogeshhd
Participant
Posts: 7
Joined: Wed Jul 21, 2010 6:28 am

sort stage is taking too much of time

Post by yogeshhd »

In my job I am using sort stages in primary and reference links which goes to the join stage. The Primary link and reference link both has huge data(oracle tables), primary link has around 15 million records and reference link has 35-40 million records. The job is running for more than 2 hours but the records does not come out of sort stages. I am using datastage sort option. There is no duplicate data in the primary or reference links.
Please let me know what performance tuning steps I can choose to make the sorting and joining to happen fastly.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

1) Reduce the amount of data being sorted (eliminate unnecessary columns and records, don't sort on keys that aren't really required)
2) Perform the sort somewhere else! Sort the data on the database SELECT clause instead of DataStage - use the horsepower of the database system.

- for that matter - if you are just doing a sort and join why aren't you doing BOTH in the database?
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Re: sort stage is taking too much of time

Post by jwiles »

What degree of parallelism are you running the job with (how many nodes)?

As you are sorting a large number of records, you can try increasing the memory allocation for the sort stage's work area. In the sort stage, you can add the "Restrict Memory" option and then modify the value. The default is 20 (20MB). Try increasing this to 200 or 256 instead.

If you are using inline/inlink sorts (checking the Sort option on the input links to the join stage), you can either replace those with the separate Sort stage and do the above, or you can set the APT_TSORT_STRESS_BLOCKSIZE environment variable to one of the same values as above.

Also, if you are partition and sorting on multiple columns, you can try reducing the number of columns you partition on. For example: Instead of partioning/sorting on columns A,B,C,D and E, partition only on A,B,C or even just A. This will speed up the partitioning process but will affect the spread of the data across partitions. You want to use the partitioning which provides the best spread across the partitions while still meeting the requirements of downstream logic.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
bobyon
Premium Member
Premium Member
Posts: 200
Joined: Tue Mar 02, 2004 10:25 am
Location: Salisbury, NC

Post by bobyon »

I hope this is not considered hijacking the topic, but....


what is the latest wisdom on whether to use datastage (8.0.1) sort or the unix (v5.3) sort option?
Bob
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Since the OP hasn't posted again in nearly a week, I wouldn't consider it hijacking. :)

DataStage's tsort operator is still the recommended method. Using the Unix sort option would involve some degree of exporting/importing data to and from the internal virtual records. There are certainly also some internal differences that would give the edge to the DS tsort operator as well in the parallel environment:

1) The ability to group like-keyed records together without sorting
2) The ability to take advantage of pre-existing sort order of records in downstream sorts (potential for big throughput increases on downstream sorts)
3) Add a column to identify key changes/duplicate record keys

These abilities are not available with the Unix sort function.
Last edited by jwiles on Tue Feb 08, 2011 12:16 am, edited 1 time in total.
- james wiles


All generalizations are false, including this one - Mark Twain.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

3) Change your expectations based on more knowledge - specifically that no rows can come out of the Sort stage till all rows have arrived into the stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

And, darn the luck, Unix sort only allows certain datatypes as key columns!

main_program: Fatal Error: Key field e is of type Date.
This is not one of the accepted types: int8, uint8, int16, uint16, int32,
uint32, int64, uint64, sfloat, dfloat, string.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
bobyon
Premium Member
Premium Member
Posts: 200
Joined: Tue Mar 02, 2004 10:25 am
Location: Salisbury, NC

Post by bobyon »

Thanks Guys.
Bob
Post Reply