Page 1 of 1

Pattern read in the DataSet Stage.

Posted: Tue Mar 18, 2008 4:00 pm
by kishorenvkb
I have a need for reading multiple files with the similar structure in one dataset stage? Since I can't specifiy a pattern in the DataSet stage, how would I achieve this?

For example, I have 4 file like Item_A.DS, Item_B.DS, Item_C.DS and Item_D.ds (all with similar column layouts). I have to read-in all the four files. If they were text files, I would have used the SequentialFile stage and would read them with the pattern Item_*.Txt.

How would I do the same with DataSets?

Thanks for your responses in advance.
Kishore Nagururu

Posted: Tue Mar 18, 2008 5:21 pm
by ray.wurlod
This can not be done with a Data Set stage. You might have some luck with a File Set. "Similar" structure is not good enough - "identical" structure is required.

If you are trying to read four Data Sets, however, this is an entirely different ball game. There is no multiple reader. Use a Funnel stage. However, the Data Sets must have identical parallelism.

Posted: Tue Mar 18, 2008 5:46 pm
by kishorenvkb
:-) Yes they are exactly identical in layout. How do you use the file set?

Posted: Tue Mar 18, 2008 9:20 pm
by ray.wurlod
Are they Data Sets or files?

You can not do the "File Set" thing if they are Data Set descriptor files, at least not sensibly, because you lack proper metadata for Data Set descriptor files.

For files there is an option in the Sequential File stage to read multiple files as a File Set.

Posted: Wed Mar 19, 2008 3:19 pm
by kishorenvkb
They are datasets. We are planning to move from sequential files to Datasets for obvious performance reasons. Any help is greatly appreciated.

Posted: Wed Mar 19, 2008 5:04 pm
by ray.wurlod
Four Data Set stages, one Funnel stage. The Data Set stage can only read one Data Set (which is, itself, a parallel structure, so has at least as many data files as there are processing nodes).

Posted: Thu Mar 20, 2008 3:08 pm
by kishorenvkb
Thanks Ray.

The situation we are in is... the number of dataset files are not fixed. We are in the pilot and as we mass... we may get more dataset files as input. We cannot afford to open up the code to add the new dataset stage, everytime.

That was the reason, why I was exploring the pattern read for datasets. Since I can't do pattern reads for datasets, what are my other options?

Thanks again.

Posted: Thu Mar 20, 2008 4:56 pm
by ray.wurlod
As they say in the classics, "tough bikkies". You don't have any alternative. You could always create ten jobs, that handle one through ten Data Sets, and use a job sequence to decide which one to run.