Page 1 of 1

Data set descriptor location

Posted: Fri Jan 16, 2009 12:59 am
by jxblack
Hi there,

I have a question regarding the set-up of data sets on a project. A data set has a descriptor and one or more data files (the actual number depending on how many nodes/partitioning is specified).

Now these data set data files will be stored by the Parallel Engine on the resource disk e.g. /disk1/Ascential/DataStage/DataSets but the location of the data set descriptor is determined by the path name specified in the Data Set stage in each job.

Can I just confirm what the best practice is (if one exists) about where the data set descriptors should be located - should they be in the same area as the data files i.e in the resource disk directory, or should they be located in a completely separate directory independent of the configuration file area?

Many thanks,

James

Posted: Fri Jan 16, 2009 1:10 am
by ray.wurlod
Completely separate, on a separate file system for preference. I usually create a subdirectory called ControlFiles in the project directory on the server.

Posted: Fri Jan 16, 2009 3:56 pm
by jxblack
Thanks Ray.

What would be the reasons for this specifically?

Is it for ease of maintenance of these files, or as a general rule we shouldn't be writing directly to the resource/scratch areas as these are internal to DataStage?

The reason I'm asking is that the proposed directory and file system organisation at the site I'm working at is not differentiating between where the descriptor and the data files of the data sets should be located.

Posted: Fri Jan 16, 2009 8:53 pm
by Alokby
I do create a folder for datasets and create sub folders one for the descripter and one for the data
e.g.
dataset
-data
-desc

Posted: Fri Jan 16, 2009 9:50 pm
by ray.wurlod
My main reason for keeping them on separate file systems is that if you lose one you don't lose the other, and may therefore be able to reconstruct at least the structure (maybe even restore from backups).

My reason for keeping the control files in a subdirectory in the project directory is mainly "keeping everything together", with a secondary reason that I can compare between, say, dev and test to verify that they're behaving similarly.