Q. apt file Node setting

wuruima · Post by **wuruima** » Wed Jul 30, 2014 10:21 pm

Dear all,
May I ask a quick question?
If I have 4 file system(every one is 10G), and I set the apt file like this

        node "node1"
        {
                fastname "xxx"
                pools ""
                resource disk "/Node1/DataSets81" {pools ""}
                resource scratchdisk "/Node1/Scratch81" {pools ""}
        }
        node "node2"
        {
                fastname "xxx"
                pools ""
                resource disk "/Node2/DataSets81" {pools ""}
                resource scratchdisk "/Node2/Scratch81" {pools ""}
        }
        node "node3"
        {
                fastname "xxx"
                pools ""
                resource disk "/Node3/DataSets81" {pools ""}
                resource scratchdisk "/Node3/Scratch81" {pools ""}
        }
        node "node4"
        {
                fastname "xxx"
                pools ""
                resource disk "/Node4/DataSets81" {pools ""}
                resource scratchdisk "/Node4/Scratch81" {pools ""}
        }

I have a parallel job, which have an input sequential file (50G), will the job auto split the 50G file to 4 nodes for calculate? or meet the error that scratch space full?I don't have enough space in my server for test. would you please kindly give me the result ?Thanks.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 31, 2014 2:07 am

The job will automatically split the data over the four nodes for processing. If you use Round Robin partitioning, which is the default in most cases, then they will be distributed evenly.

Whether or not your scratch space is consumed depends on whether your job performs operations that require scratch space, such as sorting, lookups, etc. That is not something we can easily determine. However, you can, by using the Resource Estimation tool in DataStage Designer.

priyadarshikunal · Post by **priyadarshikunal** » Fri Aug 01, 2014 8:25 am

but still you are 10 GB short

PaulVL · Post by **PaulVL** » Fri Aug 01, 2014 11:24 am

well if you are creative in your job, you could read the data in, fork off the column you need to sort into one data stream, then after that sort, join the data back into the rows.

That's the only way to not drop 50GB of data into your scratch area that is only 40GB total. BTW... that also limits you for other stages that use the scratch resource disk.