Configuration file with multiple scratch/dataset

zaino22 · Post by **zaino22** » Wed Feb 13, 2013 11:37 am

Question on configuration file.

In first example, one node has one dataset and scratch space available but in second example, it has 4 dataset, and 4 scratch spaces.

What determines that data is partitioned? I thought it was number of nodes, and having more dataset or scratch means redundancy.
Based on this if i look at example 1 AND 2 below, I would say output from the job having either of these configuration file will always be the same, sequential, and extra dataset in example 2 are for redundancy.

Have i got it all wrong?

First example:
=========

Code: Select all

{ 
	node "node0"
	{
        	fastname "L8BACK"
        	pools ""
	        resource disk "/isdataset0/myProjectName/datasets" {pools ""}
	        
	        resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""}
        	
	}

}

Second example:
============

Code: Select all

{ 
	node "node0"
	{
        	fastname "L8BACK"
        	pools ""
	        resource disk "/isdataset0/myProjectName/datasets" {pools ""}
	        resource disk "/isdataset1/myProjectName/datasets" {pools ""}
        	resource disk "/isdataset2/myProjectName/datasets" {pools ""}
	        resource disk "/isdataset3/myProjectName/datasets" {pools ""}

	        resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""}
        	resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""}
        	resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""}
        	resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""}
	}

}

ray.wurlod · Post by **ray.wurlod** » Wed Feb 13, 2013 1:59 pm

Not redundancy, just more space and more I/O bandwidth (provided they're on different file systems).

zaino22 · Post by **zaino22** » Wed Feb 13, 2013 3:19 pm

Sorry wrong choice of word. I understood them as extra space in case one overflows but wrote redundancy, I missed last ESL class

So other than that am I correct that it's only nodes that determine partition and both apt example files process sequentially ?

ray.wurlod · Post by **ray.wurlod** » Wed Feb 13, 2013 7:05 pm

Both configuration files specify execution on a single node, since only one node is defined. It might be interesting to test whether you can get parallelism for example specifying multiple readers per node for reading a sequential file, and writing to a Data Set (which would presumably be spread across the resource disks given a sufficiently large volume of data).

zaino22 · Post by **zaino22** » Sun Mar 24, 2013 5:46 pm

I tested one node configuration writing to multiple file system (for Lookup fileset/Dataset only) since sequential file was created in a separate folder and do not use configuration file.

Data was spread out across all four file systems (/isdataset0/1/2/3) in 32k size. It seemed DataStage was writing first 32K of data to /isdataset0 and then /isdataset1 and so on, and once it reached the last file system(/isdataset3), it restarted from /isdataset0. In other words, sequential write was observed.

I wanted to know if DS is writing this in parallel because of multiple writer. Our UNIX team is claiming its writing it in sequentially, and I see same thing happening.

If somebody knows a way to make this into parallel writer, please let me know.
So far the only advantage of having multiple file system for Resource and Scratch disks I see is what Ray already said "more I/O bandwidth" since they are all on different file system.

Thanks Ray for encouraging me to test this approach.

Hope this would help!

ray.wurlod · Post by **ray.wurlod** » Sun Mar 24, 2013 7:10 pm

Well, sequential in some sense, although DataStage parallel engine simply doesn't move data in chunks of less than 32KB. So what you're seeing is Round Robin allocation over the directories marked as resource disk for your node. This is the only way that DataStage does it when there are multiple resource disk directories per node.

zaino22 · Post by **zaino22** » Sun Mar 24, 2013 9:39 pm

I am excited to learn how DataStage does certain processes "under the hood", so thank you for enlightening everyone here.

Initially one of the "expert" told me that writer can write all four file system at once but when I told her my test proves otherwise, she asked me to check with Unix team, and that she suspects Unix file system is not setup properly, and if setup properly, it can write all four at once.

Her response kind of disappointed me and threw me off the "highway of excitement". I checked with Unix team and their response was was Yes it writes in Sequentially and now your response confirmed it.

Could we have achieved the parallelism (in writing files) through the multiple file system that "expert" claimed?

ray.wurlod · Post by **ray.wurlod** » Sun Mar 24, 2013 10:14 pm

In a phrase, "multiple nodes". Which is what I'm sure your expert was referring to.

Consider the following configuration file.

Code: Select all

{ 
   node "node01" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
   } 
   node "node02" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
   } 
   node "node03" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
   } 
   node "node04" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
   } 
}

This will write to all four devices in parallel.

But...

The resource disk is only used by Data Sets, File Sets and Lookup File Sets. If you're writing anything else, you have to organize any parallelism by other means. And the operating system will never let you have more than one writer to a simple text file.

zaino22 · Post by **zaino22** » Sun Mar 24, 2013 10:35 pm

"expert" was insisting that even with one node, having four file systems, four writers will be writing to four file systems. I am not that expert at all hence my visits to dsxchange to learn more. I accepted her claim initially but when you asked me to try it myself I did and told her this is now how it was happening. She insisted that's how it should work if it is not Unix team has not set it up correctly.

With my little bit of knowledge I understand having four nodes will write in parallel assuming we are working with Fileset, Lookup fileset, and Dataset. I hope you don't get annoyed when I clarify some details again since I want to pass on correct information to others after fully understanding this myself first. So thank you Ray for clarification.

From your response is it fair to assume that having one node, four file system will not write files (dataset, fileset, and lookup fileset) in parallel but it will be writing in Round Robin, and nothing else can be done to achieve what our "expert" is claiming can be done via Unix.

ray.wurlod · Post by **ray.wurlod** » Mon Mar 25, 2013 12:57 am

That's my understanding, in 32K blocks. But the round robin should be happening so quickly that they appear to be writing in parallel.

ray.wurlod · Post by **ray.wurlod** » Mon Mar 25, 2013 12:59 am

Technically your expert might be able to argue as follows; a file unit is opened on each resource disk before any actual write occurs, so that the file units are "writing" in parallel.

zaino22 · Post by **zaino22** » Mon Mar 25, 2013 3:10 am

Thank you Ray I really appreciate your input on this.