Wait for file routine - handling incomplete files?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sjordery
Premium Member
Premium Member
Posts: 202
Joined: Thu Jun 08, 2006 5:58 am

Wait for file routine - handling incomplete files?

Post by sjordery »

Hello All,

I have a sequence that runs sub-sequences once files are in place in a specified directory. I am using a routine for this, as specified by Ray Wurlod at this link viewtopic.php?t=115198&highlight=wait+file+wildcard - thanks Ray! I cannot use Wait For File because the file name is not always the same.

It is all working well, but I stumbled across a problem - in one case, when the routine was executed, a large file was mid-way through being FTP'd to the directory.. the routine saw the file was there, and kicked off the sub-sequence too early.

Can anyone suggest a way around this please? I presume that the Wait For File Activity must do something internally to only kick off jobs once the whole file is in place?

Many thanks as ever.

S
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No, the WFF stage would have the same problem - see file, get file.

The typical solution would be to utilize a semaphore file, a small usually zero-byte file that is sent after the main file. You poll for the semaphore and, when it arrives, go get the main file.

Either that or you need to build logic into your routine to see of the file is complete. I've seen people grep for the presence of known trailer information, check multiple times to see if the file is growing or use something like 'fuser' to know if the file is open by another user.

I prefer the semaphore approach. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
sjordery
Premium Member
Premium Member
Posts: 202
Joined: Thu Jun 08, 2006 5:58 am

Post by sjordery »

chulett wrote:I prefer the semaphore approach. :wink:
Thanks for the quick reply and your time Craig - I'll try and work the semaphore approach into my plan. :D

Cheers
S
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I like to FTP the file using a script and using a suffix such as ".in_process" and then, once the FTP is successful, rename the file to the correct name.
sjordery
Premium Member
Premium Member
Posts: 202
Joined: Thu Jun 08, 2006 5:58 am

Post by sjordery »

Thanks very much for that ArndW - another top tip!

Cheers,
S
stefanfrost1
Premium Member
Premium Member
Posts: 99
Joined: Mon Sep 03, 2007 7:49 am
Location: Stockholm, Sweden

Post by stefanfrost1 »

I use a different approach (on linux and unix plattforms) and that is when any file has been ftp:d or written to a specific directory , say /arrival. On completion the file(s) are moved (using mv) to another directory where datastage is listing for files, say /complete.

Datastage then moves any files to be used to another directory, say /inprogress, and when finished processing to a archive directory (if needed), say /archive.

This way i can ensure restartability and integrety not only in delivery but also when (if) the data integretion job fails. It will restart and may add additional new files to the process and/or reprocess any from the previous run. Meanwhilst new files can be delivered without disturbing the data integration process.
-------------------------------------
http://it.toolbox.com/blogs/bi-aj
my blog on delivering business intelligence using agile principles
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sure, excellent point, we do the same thing when multiple files are involved - especially if they trickle in over the course of the day. Didn't mention that as we were talking about a single file. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
sjordery
Premium Member
Premium Member
Posts: 202
Joined: Thu Jun 08, 2006 5:58 am

Post by sjordery »

stefanfrost1 wrote:I use a different approach (on linux and unix plattforms) and that is when any file has been ftp:d or written to a specific directory , say /arrival. On completion the file(s) are moved (using mv) to another directory where datastage is listing for files, say /complete.

Datastage then moves any files to be used to another directory, say /inprogress, and when finished processing to a archive directory (if needed), say /archive.

This way i can ensure restartability and integrety not only in delivery but also when (if) the data integretion job fails. It will restart and may add additional new files to the process and/or reprocess any from the previous run. Meanwhilst new files can be delivered without disturbing the data integration process.
Thanks Stefan. Can I ask what triggers the 'mv' from /arrival to /complete?

My original process executed a job that waited for a file to arrive and moved the file, once present, from a landing to an input directory, but mv was fired as soon as the file first hit the landing directory, so moved an incomplete file...

Thanks
S
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Whomever was transferring the file would need to do that, post transfer. You can't, unless you first check to ensure the file is completely transferred, and if you do that we're back where we started. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
sjordery
Premium Member
Premium Member
Posts: 202
Joined: Thu Jun 08, 2006 5:58 am

Post by sjordery »

chulett wrote:Whomever was transferring the file would need to do that, post transfer. You can't, unless you first check to ensure the file is completely transferred, and if you do that we're back where we started. :wink:
Ah, got you - so the 'mv' would be issues by whatever application posted the file. I'm there now! :lol:

Cheers,
S
Post Reply