memory issues with Sort stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

memory issues with Sort stage

Post by pavan_test »

Hi,

i have a sort stage in my DS job.my input file
has file size of 2TB. can anyone
please suggest me if i can use sort stage to sort
my input data.

What about the performance of the job and memory issues
if I use sort stage to sort such a huge file based on a
key column.

Any suggestions;

Regards
MArk
sud
Premium Member
Premium Member
Posts: 366
Joined: Fri Dec 02, 2005 5:00 am
Location: Here I Am

Re: memory issues with Sort stage

Post by sud »

For a 2 tera file I would always go for unix sort. In case you use datastage use the Unix sort option.
It took me fifteen years to discover I had no talent for ETL, but I couldn't give it up because by that time I was too famous.
shamshad
Premium Member
Premium Member
Posts: 147
Joined: Wed Aug 25, 2004 1:39 pm
Location: Detroit,MI

Post by shamshad »

Will it be possibe to run a before routine (shell script) that will simply sort your file and save the sorted inpit to another file. Then you can read the sorted file in Datastage as a source.

Unix is capable of handling sorting very efficiently. We saw considerable improvement when using sort in UNIX compared to DataStage. Actually we were first sorting a text file and then removing duplicates in DataStage job. The same thing we did in UNIX and it less time.

It's more of a design question how much logic one can keep within DataStage.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In version 7.5 and later, DataStage Sort stage will outperform UNIX sort.

Make sure that you have PLENTY of scratch disk configured, to sort a file of this size. Use multiple file systems per partition for scratch disk, to improve disk I/O throughput when using scratch disk. More is better.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply