Problem with sorting 14M records

pkothana · Post by **pkothana** » Thu Nov 13, 2003 5:11 am

Hi All,
I am using DS 6.0 with Parallel extender. I have around 14M records to transform. Stages involved are File-set, Transformer, Sort, lookup and sequential file in the same order. Looking at the log, when it reads around 40% of data, an error message comes for sort stage that: "Scratch space full". Any idea how to handle this problem?

thanks in advance.

Regards
Pinkesh

Amos.Rosmarin · Post by **Amos.Rosmarin** » Thu Nov 13, 2003 6:26 am

Hi,

There are 2 places you should look at; the sort temp directory - the one you configured in the sort stage properties and the UV temp dir from your DSENV file .

And another advice - instead of using the sort stage let the unix sort your file (using syncsort or whatever ), if you are reading from a sequential file use SORT in the filter command.

HTH,
Amos

bigpoppa · Post by **bigpoppa** » Thu Nov 13, 2003 8:57 am

Scratch Space Full means that the physical disk that the Scratchdisk is pointing to in your config file is filling up. You need to point your scratchdisk to a physical disk with more space.

- BP

Peytot · Post by **Peytot** » Thu Nov 13, 2003 9:05 am

If you have the possibility to Sort outside DataStage, does it. You will not have this problem.

Pey

jseclen · Post by **jseclen** » Thu Nov 13, 2003 10:31 am

Hi Pinkesh

You can use the unix sort is more eficient than the datastage sort ...

In the unix command you can redefine the temporary area sort with the option -T

> sort -T /home/temp ......

:D

Teej · Post by **Teej** » Thu Nov 13, 2003 12:47 pm

*sigh*

This is a PX issue. Server solutions does not resolve PX problems.

Now back to the "Stratch Disk Full"

You do know that you can point to multiple locations for the same node for your configuration file?

Are you even aware of your own configuration file? You should have something like this:

Code: Select all

{
        node "node1"
        {
                fastname "mybigcomputer"
                pools ""
                resource disk "/mountA/Dataset" {pools ""}
                resource scratchdisk "/mountA/Scratch" {pools ""}
        }
}

You can do something like this:

Code: Select all

{
        node "node1"
        {
                fastname "mybigcomputer"
                pools ""
                resource disk "/mountA/Dataset" {pools ""}
                resource disk "/mountB/Dataset" {pools ""}
                resource scratchdisk "/mountA/Scratch" {pools ""}
                resource scratchdisk "/mountB/Scratch" {pools ""}
        }
}

This file also handle how you do things in parallel for DataStage. The more nodes there are, the more DataStage throw up new processes for the same stages. Set correctly, your job will FLY.

Doing MPP? This is the same file you have to tweak. Doing SMP? Same file.

See Page 10-1 "The Parallel Extender Configuration File" on the DataStage Manager Guide online documentation which should be included on your DataStage Client installation.

As for the Sort efficiency -- 7.0.1 made a major improvement on performance for this, along with Lookup, and other issues. By designing your job to use the Unix prompt, you limit yourself to one CPU, and you also limit yourself to not taking advantage of this new version when you upgrade.

-T.J.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 13, 2003 3:05 pm

Agree totally with *sigh*

Folks, this is why new posts require you to indicate whether the post is about server, parallel or mainframe. Please heed what's there!

pkothana · Post by **pkothana** » Thu Nov 13, 2003 10:17 pm

Hi,

Thanks a Lot for your valuable suggestions.

Best Regards

Pinkesh

Teej · Post by **Teej** » Sun Nov 16, 2003 9:12 pm

Ooo, Ray! You're an inner circle boy!

Hehe.

-T.J.

DSXchange