In my job I am using sort stages in primary and reference links which goes to the join stage. The Primary link and reference link both has huge data(oracle tables), primary link has around 15 million records and reference link has 35-40 million records. The job is running for more than 2 hours but the records does not come out of sort stages. I am using datastage sort option. There is no duplicate data in the primary or reference links.
Please let me know what performance tuning steps I can choose to make the sorting and joining to happen fastly.
sort stage is taking too much of time
Moderators: chulett, rschirm, roy
1) Reduce the amount of data being sorted (eliminate unnecessary columns and records, don't sort on keys that aren't really required)
2) Perform the sort somewhere else! Sort the data on the database SELECT clause instead of DataStage - use the horsepower of the database system.
- for that matter - if you are just doing a sort and join why aren't you doing BOTH in the database?
2) Perform the sort somewhere else! Sort the data on the database SELECT clause instead of DataStage - use the horsepower of the database system.
- for that matter - if you are just doing a sort and join why aren't you doing BOTH in the database?
Re: sort stage is taking too much of time
What degree of parallelism are you running the job with (how many nodes)?
As you are sorting a large number of records, you can try increasing the memory allocation for the sort stage's work area. In the sort stage, you can add the "Restrict Memory" option and then modify the value. The default is 20 (20MB). Try increasing this to 200 or 256 instead.
If you are using inline/inlink sorts (checking the Sort option on the input links to the join stage), you can either replace those with the separate Sort stage and do the above, or you can set the APT_TSORT_STRESS_BLOCKSIZE environment variable to one of the same values as above.
Also, if you are partition and sorting on multiple columns, you can try reducing the number of columns you partition on. For example: Instead of partioning/sorting on columns A,B,C,D and E, partition only on A,B,C or even just A. This will speed up the partitioning process but will affect the spread of the data across partitions. You want to use the partitioning which provides the best spread across the partitions while still meeting the requirements of downstream logic.
Regards,
As you are sorting a large number of records, you can try increasing the memory allocation for the sort stage's work area. In the sort stage, you can add the "Restrict Memory" option and then modify the value. The default is 20 (20MB). Try increasing this to 200 or 256 instead.
If you are using inline/inlink sorts (checking the Sort option on the input links to the join stage), you can either replace those with the separate Sort stage and do the above, or you can set the APT_TSORT_STRESS_BLOCKSIZE environment variable to one of the same values as above.
Also, if you are partition and sorting on multiple columns, you can try reducing the number of columns you partition on. For example: Instead of partioning/sorting on columns A,B,C,D and E, partition only on A,B,C or even just A. This will speed up the partitioning process but will affect the spread of the data across partitions. You want to use the partitioning which provides the best spread across the partitions while still meeting the requirements of downstream logic.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
Since the OP hasn't posted again in nearly a week, I wouldn't consider it hijacking. ![Smile :)](./images/smilies/icon_smile.gif)
DataStage's tsort operator is still the recommended method. Using the Unix sort option would involve some degree of exporting/importing data to and from the internal virtual records. There are certainly also some internal differences that would give the edge to the DS tsort operator as well in the parallel environment:
1) The ability to group like-keyed records together without sorting
2) The ability to take advantage of pre-existing sort order of records in downstream sorts (potential for big throughput increases on downstream sorts)
3) Add a column to identify key changes/duplicate record keys
These abilities are not available with the Unix sort function.
![Smile :)](./images/smilies/icon_smile.gif)
DataStage's tsort operator is still the recommended method. Using the Unix sort option would involve some degree of exporting/importing data to and from the internal virtual records. There are certainly also some internal differences that would give the edge to the DS tsort operator as well in the parallel environment:
1) The ability to group like-keyed records together without sorting
2) The ability to take advantage of pre-existing sort order of records in downstream sorts (potential for big throughput increases on downstream sorts)
3) Add a column to identify key changes/duplicate record keys
These abilities are not available with the Unix sort function.
Last edited by jwiles on Tue Feb 08, 2011 12:16 am, edited 1 time in total.
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
And, darn the luck, Unix sort only allows certain datatypes as key columns!
main_program: Fatal Error: Key field e is of type Date.
This is not one of the accepted types: int8, uint8, int16, uint16, int32,
uint32, int64, uint64, sfloat, dfloat, string.
Regards,
main_program: Fatal Error: Key field e is of type Date.
This is not one of the accepted types: int8, uint8, int16, uint16, int32,
uint32, int64, uint64, sfloat, dfloat, string.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.