Page 1 of 1

Waiting for multiple files

Posted: Tue Apr 17, 2012 10:38 am
by ravindras83
Hi All

I have a situation where we get about 350 different source files (with different metadata). These files have date suffix in the name but the file name is distinct.

e.g A_02122012,B_02122012 etc.

We need to wait for the files to appear before starting processing.

I checked the wait for file stage, but it does not accept multiple file names and wildcards.

Is there way to do this without using 350 WTF stages?

I tried to put the WTF stage in between start and end loop and calculate file name through if then else based on counter in user activity stage. Though it is very big if then else statement it works.

But I cannot terminate the loop if file is not available after the wait time.

Is it possible to terminate the loop using Terminator activity in between (i.e. before all repeatations are complete)?

Right now i have written a routine to do this. I want to know if there is any better way to achieve this.

Note: requirement is that all files be present before starting processing.

Thanks
ravindras83

Posted: Tue Apr 17, 2012 11:13 am
by chulett
Sorry, but it is abbreviated as the WFF stage, the WTF stage would be something else entirely. :wink:

And you're correct in tha the WFF stage does not support wildcards, as has been discussed here a number of times. You are better off writing a script or routine to ensure that "all files" have arrived and use that in a Sequence job to conditionally run the load job only when that condition has been met.

Posted: Tue Apr 17, 2012 11:48 am
by FranklinE
With 350 source files on your plate, I suggest you have a different sort of problem that could use a different approach. Not knowing what sort of job scheduling automation you have, that would be my first choice for a solution: wait for the process (job) that creates or sends the file(s) to finish rather than wait for the files themselves.

A hybrid of that would be a job that is your file watcher. It looks like you would want several such jobs so you don't have so many files being watched by just one job.

Posted: Tue Apr 17, 2012 5:32 pm
by qt_ky
WTF stage... too funny!

Create a file having the list of file names or file name prefixes. Create a Unix shell script that reads the file list, appends date info using the Unix date command and format options, and checks if each file exists. If any file does not exist, fail, otherwise pass. The script can return different codes based on pass/fail such as 0 or 1. DataStage sequence job can call the script and act according to its return code.

Posted: Tue Apr 17, 2012 9:54 pm
by kandyshandy
Craig, that's a good one.. :)

This may not cover all negative scenarios/error handling/exceptions, but can be a simple check.

Assumption: You will not have any other files in that folder other than the expected 350 files ;)

Instead of having a WFF stage in the loop, just have a command stage with ls -1 | wc -l (or something similar to that), which will give you the file count. Once the count reaches 350, trigger your load jobs!!

Re: Waiting for multiple files

Posted: Tue Apr 17, 2012 10:01 pm
by SURA
If i would be in there, i will as to create 2 folder.

1 is for LANDING and the other is for STAT. In the LANDING all the real data files will be placed. Once all the files arrived, finally the loaded.ok (0 size file) file will be placed in the STAT folder. You wait for file will be looking for this loaded.ok file. At the end of the process delete the .ok file. So that it can do the same in the next run.

Let me know if my understanding is not correct.

Posted: Tue Apr 17, 2012 10:27 pm
by vamsi.4a6
qt_ky wrote:Create a file having the list of file names or file name prefixes. Create a Unix shell script that reads the file list, appends date info using the Unix date command and format options, and checks if each file exists.
1)Can anybody explain what is the need for appends date info using the Unix date command and format options?

I think if we created the file with the the list of file names this step is not required and please correct me if i am wrong.

Re: Waiting for multiple files

Posted: Tue Apr 17, 2012 11:16 pm
by kandyshandy
ravindras83 wrote:I have a situation where we get about 350 different source files (with different metadata). These files have date suffix in the name but the file name is distinct.
OP mentioned the suffix and that's why Eric suggested the date appending logic ;)

Re: Waiting for multiple files

Posted: Wed Apr 18, 2012 6:23 am
by qt_ky
The reason is given in the first 2 lines of this topic, but here it is again:
ravindras83 wrote:These files have date suffix in the name but the file name is distinct.

e.g A_02122012,B_02122012 etc.
And one can easily imagine that the date suffix in each file name will change every day.

Unix date command is one way to generate a required date format. For example (MMDDYYYY format for April 18, 2012):

date +%m%d%Y
04182012

Posted: Wed Apr 18, 2012 6:46 am
by chulett
I tend to shy away from date specific solutions like that as you have no ability to 'catch up' if for whatever reason you miss a day or files arrive late with a different date on them. Better to have a mechanism to take whatever is there regardless of date and then ensure they are archived / moved / recorded so they aren't processed again.

To a point made earlier about tying into the delivery system - that would always be the best solution if available. Most of the time you are at the mercy of whomever is delivering anything so you need to do your best to make sure you have everything. An Enterprise scheduler that could kick off the load job after the delivery process is complete would be ideal. A semaphore that you wait for that is delivered last from the source system would be nice as well. Most of the time you're on your own, however.

Posted: Wed Apr 18, 2012 10:34 am
by ravindras83
Thanks all for your replies

sorry for the wrong abbreviation

i am doing what qt_ky suggested but through DS routine.

Kindly anyone help me with the other question.

can i terminate (using terminator activity) a loop before loop repeatations are complete?


thanks
ravindras83

Posted: Wed Apr 18, 2012 10:41 am
by FranklinE
You can (not) code for an implied end to the loop before it completes its maximum iterations.

I have a loop that runs up to 20 times based on the contents of up to 20 files. If the next file exists but has nothing in it, I have an abort link. If it has more than one row in it, I have a processing link. A file that has exactly one row in it has no link for it, and the loop ends on that condition with an Info message that the loop did not complete its defined number of iterations.

Posted: Wed Apr 18, 2012 11:24 am
by chulett
You can exit the loop early if you need to, simply branch from inside the loop to a Terminator (if you want things to stop abruptly) or branch to a stage past the End Loop stage if you just want to sneak out early. :wink:

Posted: Thu Apr 19, 2012 10:25 am
by ravindras83
thanks all

closing the topic