Looping Logic - How to read multiple files from a Directory

DSFreddie · Post by **DSFreddie** » Mon Dec 19, 2011 3:12 pm

Hi All,

I have one requirement in project where the input files(Fixed Width) need to be read from the Input Directory & combine the files for further processing. (Weekly job).

Lets assume this job didnt run on the 7th day (let's say it is running 2 days later). In this case, we need to read 9 days file, combine it to one single file for further processing.

Can you pls shed some light on how can we can accomplish in DS.

Thanks,
Freddie

ray.wurlod · Post by **ray.wurlod** » Mon Dec 19, 2011 3:15 pm

Assuming they are moved from the directory once processed, I'd use cat as the filter command.

elias.shaik · Post by **elias.shaik** » Mon Dec 19, 2011 4:24 pm

Can you read all the files as file pattern ? Or you need to process only one file in a single run ?

DSFreddie · Post by **DSFreddie** » Mon Dec 19, 2011 5:59 pm

Hi Ray/Elias,

Thanks for your inputs.

The flow should be like this,

Lets assume the batch ran today (19th Dec). The same batch is supposed to run on 26th Dec since it is weekly. Consider it didnt run due to some issue & it is ready to run on 29th Dec. In this case, the process need to read 10 day worth files (Same layout) from the Archive Dir (this dir will have 20 days data files), combine it & do the transformations before loading to target.

Thanks,
Freddie

ray.wurlod · Post by **ray.wurlod** » Mon Dec 19, 2011 7:18 pm

A better approach would be not to move the files to the archive directory until they have been processed. Given what you have, you can still use cat in the Filter command - two cat commands, in fact:

Code: Select all

cat ArchiveDir/*;cat InputDir/*

Kirtikumar · Post by **Kirtikumar** » Mon Dec 19, 2011 10:16 pm

You can do the following:
1. Using ls command on that directory create FileList.lst. This would tell you the files considered.
2. Then from this file list using xargs and cat, concatenate the files.
3. In the DS job read the concatenated file and process it.
4. Once whole processing is done, using the FileList.lst and xargs alongwith gzip compress and move them to archive.

DSFreddie · Post by **DSFreddie** » Tue Dec 20, 2011 3:33 pm

Hi Ray/Kirti,

Thanks for the inputs.

To answer to Ray's point, these files are created by another application & we dont have the provision to control their archive process ( So, the files need to be read from the Archive).

Kirti, answering to your suggestion.
1. Using ls command on that directory create FileList.lst. This would tell you the files considered.
Ans : The Archive directory will contain 20 days file. This is a weekly process & assume that it ran after 9 days. In this case, we need to read the files for the past 9 days from the day the batch ran last time. Can you pls explain me how the ls command will work in this scenario & how to create a Filelist ? Is it a form of text file ?
2. Then from this file list using xargs and cat, concatenate the files.
Can you pls tell me what xargs function do ?
3. In the DS job read the concatenated file and process it.
4. Once whole processing is done, using the FileList.lst and xargs alongwith gzip compress and move them to archive.

I am wondering whether there is a way we can accompllish it using Exec Command/Start-end loop/User variable activities. I am stuck up in a point where i extract the last run date, but what should i do to read the past N days file & concatenate it for further processing.

Thanks
Freddie

Mandy23 · Post by **Mandy23** » Tue Dec 20, 2011 3:48 pm

Hi Freddie,

First of all ,if there is a posibility to store the Last Run Date of the Job / File Name then below approach will work.

1.You have to prepare a simple shell Script which can List the Files , filter the File names which are greater that Last Run Date of the Job/ File. Write these list into a Delimited File ( ls -m ).

2. Use Sequence Loop Activity to Read these file one by one and Process.

SURA · Post by **SURA** » Tue Dec 20, 2011 9:42 pm

I would say YES to Rays approach / Something similar to that.

What i mean is, create a two folders. 1 is to tmpland and tmparch.

Keep a process to copy the file from Source Landing DIR to your temp landing / You can do it in the before routine. Use that file as your source.

Keep it there until you processed. Once it is processed move it to your ARCH dir.

So that you will get the freedom to handle it as per your wish.

Every 8th day you can delete the 1st file.

If your ENV permit, do it in this way is easy.

DS User

Kirtikumar · Post by **Kirtikumar** » Wed Dec 21, 2011 12:56 am

Ok. So what you mean is - Archival is not in our control. So let us say job runs after 10 days, it is possible that some files would be in InputDir and some in ArchDir. Right?

Also is the ArchDir a plain directory or it has date wise folders? The answers to these question would decide what logic we use.

One option is: maintain a FileLoaded table. Every time our process starts look into Archive and InputDir for FileList. Compare it with the table and only load the ones which are not there in the table. Once processing is done, add these new files to the FileLoaded table.

Answering your other question re xargs: the command uses the parameters from standard input and runs the Unix commands on them. Google is and you have a lot of articles on it. e.g. ls Account*.csv | xargs grep "Pattern" - would run grep command on all the file names reported by the ls command. It is same as grep "Pattern" Account*.csv.

In some cases when you need the FileNames later as well, you can create file list as ls Account*.csv > FileList.lst. Then use this for processing like combining data. E.g. cat FileList.lst | xargs cat.