Looping Logic - How to read multiple files from a Directory

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Looping Logic - How to read multiple files from a Directory

Post by DSFreddie »

Hi All,

I have one requirement in project where the input files(Fixed Width) need to be read from the Input Directory & combine the files for further processing. (Weekly job).

Lets assume this job didnt run on the 7th day (let's say it is running 2 days later). In this case, we need to read 9 days file, combine it to one single file for further processing.

Can you pls shed some light on how can we can accomplish in DS.

Thanks,
Freddie
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Assuming they are moved from the directory once processed, I'd use cat as the filter command.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
elias.shaik
Participant
Posts: 51
Joined: Sat Dec 09, 2006 3:32 am

Post by elias.shaik »

Can you read all the files as file pattern ? Or you need to process only one file in a single run ?
------------
Elias
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Hi Ray/Elias,

Thanks for your inputs.

The flow should be like this,

Lets assume the batch ran today (19th Dec). The same batch is supposed to run on 26th Dec since it is weekly. Consider it didnt run due to some issue & it is ready to run on 29th Dec. In this case, the process need to read 10 day worth files (Same layout) from the Archive Dir (this dir will have 20 days data files), combine it & do the transformations before loading to target.

Thanks,
Freddie
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A better approach would be not to move the files to the archive directory until they have been processed. Given what you have, you can still use cat in the Filter command - two cat commands, in fact:

Code: Select all

cat ArchiveDir/*;cat InputDir/*
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

You can do the following:
1. Using ls command on that directory create FileList.lst. This would tell you the files considered.
2. Then from this file list using xargs and cat, concatenate the files.
3. In the DS job read the concatenated file and process it.
4. Once whole processing is done, using the FileList.lst and xargs alongwith gzip compress and move them to archive.
Regards,
S. Kirtikumar.
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Hi Ray/Kirti,

Thanks for the inputs.

To answer to Ray's point, these files are created by another application & we dont have the provision to control their archive process ( So, the files need to be read from the Archive).

Kirti, answering to your suggestion.
1. Using ls command on that directory create FileList.lst. This would tell you the files considered.
Ans : The Archive directory will contain 20 days file. This is a weekly process & assume that it ran after 9 days. In this case, we need to read the files for the past 9 days from the day the batch ran last time. Can you pls explain me how the ls command will work in this scenario & how to create a Filelist ? Is it a form of text file ?
2. Then from this file list using xargs and cat, concatenate the files.
Can you pls tell me what xargs function do ?
3. In the DS job read the concatenated file and process it.
4. Once whole processing is done, using the FileList.lst and xargs alongwith gzip compress and move them to archive.

I am wondering whether there is a way we can accompllish it using Exec Command/Start-end loop/User variable activities. I am stuck up in a point where i extract the last run date, but what should i do to read the past N days file & concatenate it for further processing.

Thanks
Freddie
Mandy23
Participant
Posts: 8
Joined: Wed Nov 09, 2011 4:04 pm
Location: USA

Post by Mandy23 »

Hi Freddie,

First of all ,if there is a posibility to store the Last Run Date of the Job / File Name then below approach will work.

1.You have to prepare a simple shell Script which can List the Files , filter the File names which are greater that Last Run Date of the Job/ File. Write these list into a Delimited File ( ls -m ).

2. Use Sequence Loop Activity to Read these file one by one and Process.
Mandy
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

I would say YES to Rays approach / Something similar to that.

What i mean is, create a two folders. 1 is to tmpland and tmparch.

Keep a process to copy the file from Source Landing DIR to your temp landing / You can do it in the before routine. Use that file as your source.

Keep it there until you processed. Once it is processed move it to your ARCH dir.

So that you will get the freedom to handle it as per your wish.

Every 8th day you can delete the 1st file.

If your ENV permit, do it in this way is easy.

DS User
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Ok. So what you mean is - Archival is not in our control. So let us say job runs after 10 days, it is possible that some files would be in InputDir and some in ArchDir. Right?

Also is the ArchDir a plain directory or it has date wise folders? The answers to these question would decide what logic we use.

One option is: maintain a FileLoaded table. Every time our process starts look into Archive and InputDir for FileList. Compare it with the table and only load the ones which are not there in the table. Once processing is done, add these new files to the FileLoaded table.

Answering your other question re xargs: the command uses the parameters from standard input and runs the Unix commands on them. Google is and you have a lot of articles on it. e.g. ls Account*.csv | xargs grep "Pattern" - would run grep command on all the file names reported by the ls command. It is same as grep "Pattern" Account*.csv.

In some cases when you need the FileNames later as well, you can create file list as ls Account*.csv > FileList.lst. Then use this for processing like combining data. E.g. cat FileList.lst | xargs cat.
Regards,
S. Kirtikumar.
Post Reply