Problem with sorting 14M records

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
pkothana
Participant
Posts: 50
Joined: Tue Oct 14, 2003 6:12 am

Problem with sorting 14M records

Post by pkothana »

Hi All,
I am using DS 6.0 with Parallel extender. I have around 14M records to transform. Stages involved are File-set, Transformer, Sort, lookup and sequential file in the same order. Looking at the log, when it reads around 40% of data, an error message comes for sort stage that: "Scratch space full". Any idea how to handle this problem?

thanks in advance.

Regards
Pinkesh
Amos.Rosmarin
Premium Member
Premium Member
Posts: 385
Joined: Tue Oct 07, 2003 4:55 am

Post by Amos.Rosmarin »

Hi,

There are 2 places you should look at; the sort temp directory - the one you configured in the sort stage properties and the UV temp dir from your DSENV file .

And another advice - instead of using the sort stage let the unix sort your file (using syncsort or whatever ), if you are reading from a sequential file use SORT in the filter command.

HTH,
Amos
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Problem with sorting 14M records

Post by bigpoppa »

Scratch Space Full means that the physical disk that the Scratchdisk is pointing to in your config file is filling up. You need to point your scratchdisk to a physical disk with more space.

- BP
Peytot
Participant
Posts: 145
Joined: Wed Jun 04, 2003 7:56 am
Location: France

Post by Peytot »

If you have the possibility to Sort outside DataStage, does it. You will not have this problem.

Pey
jseclen
Participant
Posts: 133
Joined: Wed Mar 05, 2003 4:19 pm
Location: Lima - Peru. Sudamerica
Contact:

Re: Problem with sorting 14M records

Post by jseclen »

Hi Pinkesh

You can use the unix sort is more eficient than the datastage sort ...

In the unix command you can redefine the temporary area sort with the option -T

> sort -T /home/temp ......

:D
Saludos,

Miguel Seclén
Lima - Peru
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Problem with sorting 14M records

Post by Teej »

*sigh*

This is a PX issue. Server solutions does not resolve PX problems.

Now back to the "Stratch Disk Full"

You do know that you can point to multiple locations for the same node for your configuration file?

Are you even aware of your own configuration file? You should have something like this:

Code: Select all

{
        node "node1"
        {
                fastname "mybigcomputer"
                pools ""
                resource disk "/mountA/Dataset" {pools ""}
                resource scratchdisk "/mountA/Scratch" {pools ""}
        }
}
You can do something like this:

Code: Select all

{
        node "node1"
        {
                fastname "mybigcomputer"
                pools ""
                resource disk "/mountA/Dataset" {pools ""}
                resource disk "/mountB/Dataset" {pools ""}
                resource scratchdisk "/mountA/Scratch" {pools ""}
                resource scratchdisk "/mountB/Scratch" {pools ""}
        }
}
This file also handle how you do things in parallel for DataStage. The more nodes there are, the more DataStage throw up new processes for the same stages. Set correctly, your job will FLY.

Doing MPP? This is the same file you have to tweak. Doing SMP? Same file.

See Page 10-1 "The Parallel Extender Configuration File" on the DataStage Manager Guide online documentation which should be included on your DataStage Client installation.

As for the Sort efficiency -- 7.0.1 made a major improvement on performance for this, along with Lookup, and other issues. By designing your job to use the Unix prompt, you limit yourself to one CPU, and you also limit yourself to not taking advantage of this new version when you upgrade.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Agree totally with *sigh*

Folks, this is why new posts require you to indicate whether the post is about server, parallel or mainframe. Please heed what's there!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pkothana
Participant
Posts: 50
Joined: Tue Oct 14, 2003 6:12 am

Post by pkothana »

Hi,

Thanks a Lot for your valuable suggestions.

Best Regards

Pinkesh
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

Ooo, Ray! You're an inner circle boy! :shock:

Hehe.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Post Reply