Pattern read in the DataSet Stage.

kishorenvkb · Post by **kishorenvkb** » Tue Mar 18, 2008 4:00 pm

I have a need for reading multiple files with the similar structure in one dataset stage? Since I can't specifiy a pattern in the DataSet stage, how would I achieve this?

For example, I have 4 file like Item_A.DS, Item_B.DS, Item_C.DS and Item_D.ds (all with similar column layouts). I have to read-in all the four files. If they were text files, I would have used the SequentialFile stage and would read them with the pattern Item_*.Txt.

How would I do the same with DataSets?

Thanks for your responses in advance.
Kishore Nagururu

ray.wurlod · Post by **ray.wurlod** » Tue Mar 18, 2008 5:21 pm

This can not be done with a Data Set stage. You might have some luck with a File Set. "Similar" structure is not good enough - "identical" structure is required.

If you are trying to read four Data Sets, however, this is an entirely different ball game. There is no multiple reader. Use a Funnel stage. However, the Data Sets must have identical parallelism.

kishorenvkb · Post by **kishorenvkb** » Tue Mar 18, 2008 5:46 pm

Yes they are exactly identical in layout. How do you use the file set?

ray.wurlod · Post by **ray.wurlod** » Tue Mar 18, 2008 9:20 pm

Are they Data Sets or files?

You can not do the "File Set" thing if they are Data Set descriptor files, at least not sensibly, because you lack proper metadata for Data Set descriptor files.

For files there is an option in the Sequential File stage to read multiple files as a File Set.

kishorenvkb · Post by **kishorenvkb** » Wed Mar 19, 2008 3:19 pm

They are datasets. We are planning to move from sequential files to Datasets for obvious performance reasons. Any help is greatly appreciated.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 19, 2008 5:04 pm

Four Data Set stages, one Funnel stage. The Data Set stage can only read one Data Set (which is, itself, a parallel structure, so has at least as many data files as there are processing nodes).

kishorenvkb · Post by **kishorenvkb** » Thu Mar 20, 2008 3:08 pm

Thanks Ray.

The situation we are in is... the number of dataset files are not fixed. We are in the pilot and as we mass... we may get more dataset files as input. We cannot afford to open up the code to add the new dataset stage, everytime.

That was the reason, why I was exploring the pattern read for datasets. Since I can't do pattern reads for datasets, what are my other options?

Thanks again.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 20, 2008 4:56 pm

As they say in the classics, "tough bikkies". You don't have any alternative. You could always create ten jobs, that handle one through ten Data Sets, and use a job sequence to decide which one to run.