Page 1 of 1

sort stage is taking too much of time

Posted: Tue Feb 01, 2011 11:06 am
by yogeshhd
In my job I am using sort stages in primary and reference links which goes to the join stage. The Primary link and reference link both has huge data(oracle tables), primary link has around 15 million records and reference link has 35-40 million records. The job is running for more than 2 hours but the records does not come out of sort stages. I am using datastage sort option. There is no duplicate data in the primary or reference links.
Please let me know what performance tuning steps I can choose to make the sorting and joining to happen fastly.

Posted: Tue Feb 01, 2011 11:55 am
by asorrell
1) Reduce the amount of data being sorted (eliminate unnecessary columns and records, don't sort on keys that aren't really required)
2) Perform the sort somewhere else! Sort the data on the database SELECT clause instead of DataStage - use the horsepower of the database system.

- for that matter - if you are just doing a sort and join why aren't you doing BOTH in the database?

Re: sort stage is taking too much of time

Posted: Tue Feb 01, 2011 11:57 am
by jwiles
What degree of parallelism are you running the job with (how many nodes)?

As you are sorting a large number of records, you can try increasing the memory allocation for the sort stage's work area. In the sort stage, you can add the "Restrict Memory" option and then modify the value. The default is 20 (20MB). Try increasing this to 200 or 256 instead.

If you are using inline/inlink sorts (checking the Sort option on the input links to the join stage), you can either replace those with the separate Sort stage and do the above, or you can set the APT_TSORT_STRESS_BLOCKSIZE environment variable to one of the same values as above.

Also, if you are partition and sorting on multiple columns, you can try reducing the number of columns you partition on. For example: Instead of partioning/sorting on columns A,B,C,D and E, partition only on A,B,C or even just A. This will speed up the partitioning process but will affect the spread of the data across partitions. You want to use the partitioning which provides the best spread across the partitions while still meeting the requirements of downstream logic.

Regards,

Posted: Mon Feb 07, 2011 1:07 pm
by bobyon
I hope this is not considered hijacking the topic, but....


what is the latest wisdom on whether to use datastage (8.0.1) sort or the unix (v5.3) sort option?

Posted: Mon Feb 07, 2011 8:42 pm
by jwiles
Since the OP hasn't posted again in nearly a week, I wouldn't consider it hijacking. :)

DataStage's tsort operator is still the recommended method. Using the Unix sort option would involve some degree of exporting/importing data to and from the internal virtual records. There are certainly also some internal differences that would give the edge to the DS tsort operator as well in the parallel environment:

1) The ability to group like-keyed records together without sorting
2) The ability to take advantage of pre-existing sort order of records in downstream sorts (potential for big throughput increases on downstream sorts)
3) Add a column to identify key changes/duplicate record keys

These abilities are not available with the Unix sort function.

Posted: Tue Feb 08, 2011 12:10 am
by ray.wurlod
3) Change your expectations based on more knowledge - specifically that no rows can come out of the Sort stage till all rows have arrived into the stage.

Posted: Tue Feb 08, 2011 12:25 am
by jwiles
And, darn the luck, Unix sort only allows certain datatypes as key columns!

main_program: Fatal Error: Key field e is of type Date.
This is not one of the accepted types: int8, uint8, int16, uint16, int32,
uint32, int64, uint64, sfloat, dfloat, string.

Regards,

Posted: Tue Feb 08, 2011 6:19 am
by bobyon
Thanks Guys.