Configuration file with multiple scratch/dataset

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
zaino22
Premium Member
Premium Member
Posts: 81
Joined: Thu Mar 22, 2007 7:10 pm

Configuration file with multiple scratch/dataset

Post by zaino22 »

Question on configuration file.

In first example, one node has one dataset and scratch space available but in second example, it has 4 dataset, and 4 scratch spaces.

What determines that data is partitioned? I thought it was number of nodes, and having more dataset or scratch means redundancy.
Based on this if i look at example 1 AND 2 below, I would say output from the job having either of these configuration file will always be the same, sequential, and extra dataset in example 2 are for redundancy.

Have i got it all wrong?

First example:
=========

Code: Select all

{ 
	node "node0"
	{
        	fastname "L8BACK"
        	pools ""
	        resource disk "/isdataset0/myProjectName/datasets" {pools ""}
	        
	        resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""}
        	
	}

}
Second example:
============

Code: Select all

{ 
	node "node0"
	{
        	fastname "L8BACK"
        	pools ""
	        resource disk "/isdataset0/myProjectName/datasets" {pools ""}
	        resource disk "/isdataset1/myProjectName/datasets" {pools ""}
        	resource disk "/isdataset2/myProjectName/datasets" {pools ""}
	        resource disk "/isdataset3/myProjectName/datasets" {pools ""}

	        resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""}
        	resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""}
        	resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""}
        	resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""}
	}

}
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not redundancy, just more space and more I/O bandwidth (provided they're on different file systems).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zaino22
Premium Member
Premium Member
Posts: 81
Joined: Thu Mar 22, 2007 7:10 pm

Post by zaino22 »

Sorry wrong choice of word. I understood them as extra space in case one overflows but wrote redundancy, I missed last ESL class :)

So other than that am I correct that it's only nodes that determine partition and both apt example files process sequentially ?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Both configuration files specify execution on a single node, since only one node is defined. It might be interesting to test whether you can get parallelism for example specifying multiple readers per node for reading a sequential file, and writing to a Data Set (which would presumably be spread across the resource disks given a sufficiently large volume of data).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zaino22
Premium Member
Premium Member
Posts: 81
Joined: Thu Mar 22, 2007 7:10 pm

Post by zaino22 »

I tested one node configuration writing to multiple file system (for Lookup fileset/Dataset only) since sequential file was created in a separate folder and do not use configuration file.

Data was spread out across all four file systems (/isdataset0/1/2/3) in 32k size. It seemed DataStage was writing first 32K of data to /isdataset0 and then /isdataset1 and so on, and once it reached the last file system(/isdataset3), it restarted from /isdataset0. In other words, sequential write was observed.

I wanted to know if DS is writing this in parallel because of multiple writer. Our UNIX team is claiming its writing it in sequentially, and I see same thing happening.

If somebody knows a way to make this into parallel writer, please let me know.
So far the only advantage of having multiple file system for Resource and Scratch disks I see is what Ray already said "more I/O bandwidth" since they are all on different file system.

Thanks Ray for encouraging me to test this approach.

Hope this would help!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Well, sequential in some sense, although DataStage parallel engine simply doesn't move data in chunks of less than 32KB. So what you're seeing is Round Robin allocation over the directories marked as resource disk for your node. This is the only way that DataStage does it when there are multiple resource disk directories per node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zaino22
Premium Member
Premium Member
Posts: 81
Joined: Thu Mar 22, 2007 7:10 pm

Post by zaino22 »

I am excited to learn how DataStage does certain processes "under the hood", so thank you for enlightening everyone here.

Initially one of the "expert" told me that writer can write all four file system at once but when I told her my test proves otherwise, she asked me to check with Unix team, and that she suspects Unix file system is not setup properly, and if setup properly, it can write all four at once.

Her response kind of disappointed me and threw me off the "highway of excitement". I checked with Unix team and their response was was Yes it writes in Sequentially and now your response confirmed it.

Could we have achieved the parallelism (in writing files) through the multiple file system that "expert" claimed?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In a phrase, "multiple nodes". Which is what I'm sure your expert was referring to.

Consider the following configuration file.

Code: Select all

{ 
   node "node01" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
   } 
   node "node02" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
   } 
   node "node03" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
   } 
   node "node04" 
   { 
           fastname "L8BACK" 
           pools "" 
           resource disk "/isdataset3/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset0/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset1/myProjectName/datasets" {pools ""} 
           resource disk "/isdataset2/myProjectName/datasets" {pools ""} 

           resource scratchdisk "/isscdisk3/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk0/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk1/myProjectName/scratch" {pools ""} 
           resource scratchdisk "/isscdisk2/myProjectName/scratch" {pools ""} 
   } 
}
This will write to all four devices in parallel.

But...

The resource disk is only used by Data Sets, File Sets and Lookup File Sets. If you're writing anything else, you have to organize any parallelism by other means. And the operating system will never let you have more than one writer to a simple text file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zaino22
Premium Member
Premium Member
Posts: 81
Joined: Thu Mar 22, 2007 7:10 pm

Post by zaino22 »

"expert" was insisting that even with one node, having four file systems, four writers will be writing to four file systems. I am not that expert at all hence my visits to dsxchange to learn more. I accepted her claim initially but when you asked me to try it myself I did and told her this is now how it was happening. She insisted that's how it should work if it is not Unix team has not set it up correctly.

With my little bit of knowledge I understand having four nodes will write in parallel assuming we are working with Fileset, Lookup fileset, and Dataset. I hope you don't get annoyed when I clarify some details again since I want to pass on correct information to others after fully understanding this myself first. So thank you Ray for clarification. :)

From your response is it fair to assume that having one node, four file system will not write files (dataset, fileset, and lookup fileset) in parallel but it will be writing in Round Robin, and nothing else can be done to achieve what our "expert" is claiming can be done via Unix.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's my understanding, in 32K blocks. But the round robin should be happening so quickly that they appear to be writing in parallel.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Technically your expert might be able to argue as follows; a file unit is opened on each resource disk before any actual write occurs, so that the file units are "writing" in parallel.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zaino22
Premium Member
Premium Member
Posts: 81
Joined: Thu Mar 22, 2007 7:10 pm

Post by zaino22 »

Thank you Ray I really appreciate your input on this.
Post Reply