Consuming split files in a job

saikrishna · Post by **saikrishna** » Thu Aug 21, 2008 1:21 am

Hi

We have designed a job which has the following structure

SeqFile -> Tfm -> OraBulk

If I want to load more than one file(for ex: 100 files) to the same Database table using the same job, is there any best way to do this?

In Oracle, there is an external table concept with which we can get the data from more than one file.....

Is there any concept like this in DataStage ?

I have the following options:
1. Parametrize the file names in sequential file stage and call this job with the invocation id on.
Create a sequence and call this job using loop and invocation id.
The problem in this is we will be running sequentially these invocation jobs... It would be great if we can run these parallelly.

Any inputs would be great....

Thanks
Sai

ray.wurlod · Post by **ray.wurlod** » Thu Aug 21, 2008 4:11 am

Investigate the Folder stage.

Investigate using a Filter command in your Sequential File stage that uses cat to spool all the files into the job as if they were one large data stream.

saikrishna · Post by **saikrishna** » Thu Aug 21, 2008 5:49 am

I would like to use the first approach u hv told... i.e. Folder stage..

How do we pass the output of folder stage, (i.e list of files in the folder) in the "column name" of folder stage to the "file name" in the sequential file stage?

Thanks
Sai

ray.wurlod · Post by **ray.wurlod** » Thu Aug 21, 2008 6:36 am

You don't. Read the chapter in the manual about the Folder stage.

chulett · Post by **chulett** » Thu Aug 21, 2008 6:59 am

Me, I'd just cat all the files together then bulk load. Once.

Depending on your skill level, you could build a looping Sequence that doesn't wait for each job so that (eventually) they will all be running in parallel, but then you may end up with locking and/or resource problems.

saikrishna · Post by **saikrishna** » Thu Aug 21, 2008 7:58 am

Hi Chullet, Ray

I was not selected cat filter because the size of each file is huge ... The cat operation will take lot of resources...

Thanks
Sai

chulett · Post by **chulett** » Thu Aug 21, 2008 8:01 am

I doubt that unless 'resources' means 'disk space'... have you actually tried it? Ray's way does this in a virtual fashion, so no 'extra' resources there. You could also consider a named pipe...

saikrishna · Post by **saikrishna** » Thu Aug 21, 2008 8:22 am

why not folder stage??

chulett · Post by **chulett** » Thu Aug 21, 2008 8:43 am

Why not Folder stage what? Have you read up on how it works? It was built for XML and not really appropriate here. IMHO.

ray.wurlod · Post by **ray.wurlod** » Thu Aug 21, 2008 3:13 pm

The cat command will use hardly any resources at all. The files are already on disk. Output from the cat command is not written to disk; it becomes the input to the sequential file stage. If you like the effect is that of

Code: Select all

cat files* | DataStage job

saikrishna · Post by **saikrishna** » Thu Aug 21, 2008 11:55 pm

Thanks ray, Chullet...

Whatever you said is right for cat option...I would see this in practical....

For Folder stage...I went through documentation ..it can have only two outputs, i.e. filename and file content...May be this is ideally suited to XML documents reading... I wanted to know whether folder stage can be used here or not.....

Chullet... You said using named pipes also possible?, if you have any idea can you please share it?

Thanks
Sai

ray.wurlod · Post by **ray.wurlod** » Fri Aug 22, 2008 12:42 am

Folder stage is not an option when file size is too large. Don't have exact figures on that ready to hand - if, indeed, the limit is documented at all. It has to put the entire contents of the file into a single field.

chulett · Post by **chulett** » Fri Aug 22, 2008 6:52 am

You can build a process to feed the files to a named pipe and then the Sequential stage supports reading from a pipe. Click on Help in the stage to read about the 'Stage uses named pipes' option.