Page 1 of 1

Monitor network file location for new files

Posted: Tue May 17, 2016 7:42 am
by alex999
Hi all

We are developing application managing some excel files. Case is that application will produce some excel files and save them in specific location.
We need now to import these files into database with Datastage just after someone save a new file in this specific location. So we need to make Datastage monitor somehow this location all the time and when there is any new file, it should start a job and import the file. Is it possible somehow?
How to configure such type of job?

Thank you for any responses

Posted: Tue May 17, 2016 11:48 am
by rkashyap
(Assuming that the application creating the excel file cannot drop it on the DataStage/Unix box) Mount/share the source directory on DataStage box. Subsequently use a filewatcher to trigger other activities of your job. Filewatcher can be implemented using your scheduling tool or Wait For File Activity stage.

Posted: Tue May 17, 2016 1:55 pm
by alex999
That is a great tip! I haven't use this Wait-For-File activity yet.

Is it possible to configure it to give there path to find ANY new file or it has to be specific file name? In the source directory there will be files like: report_hr_20160501, report_hr_20160601 and one day appear for example report_hr_20160701 and report_hr_20160601_v2, datastage should take only these 2 new reports. Probably we won't know names of these files.

Posted: Tue May 17, 2016 2:27 pm
by rkashyap
Wait For File Activity stage looks for a specific filename and does not support wildcards. See this discussion.

Posted: Tue May 17, 2016 5:33 pm
by chulett
And failing the ability to mount them on your ETL server (can't remember the last place I worked where that was allowed to happen) you'll need to look into a script of some kind to do the remote wildcard file existence check. Or if your version supports it, use an FTP source in a job to do basically an "mget" on files that you schedule to run X times a day.

Next question in my mind - how are you planning on processing the native xls / xlsx files? ODBC? Or will they be exporting them as .csv files for you?

You'll also need to mange the whole "only new files" part yourself, which can be fun. Are you allowed to rename the source files? Move them elsewhere on the source system? Delete them when you are done with them? Failing that post-processing ability, you might need to fall back on keeping a list of filenames you've already seen and only allow "new" names thru. Can be handy if they accidentally push the same file to you more than once and you only want to process it the first time.

Posted: Wed May 18, 2016 12:06 am
by alex999
Application can convert files into *.csv. I think it would be even better for DS.

I should be allowed to rename source files. I don't want to rename source xls files but then I convert it into csv they could be renamed.
Files shouldnt be removed from this destination, when they are out of date they should be renamed as old version or something like that - this is web application part. so that is also impossible to push the same file - that would be the same as new version. The application shoud get done the case of moving, converting and renaming.

As there is kind of versioning, i will need some additional columns in database for sure with version, date from, date to and status if it is an current version.

So.. taking this versioning into consideration... DS could find somehow file with new version in name but I dont thing it would be possible..

Or maybe sth like that: every new file wouldn't have any version in name, for example it would always be Report1.csv, when new one comes, Report1 become Report1_v1 and the new one is then Report1, next is coming and we have Report1_v1, Report1_v2 and current Report1...
Datastage should check then always only for Report1...

There should be only one current Report source file.

Posted: Wed May 18, 2016 4:54 am
by alex999
we have developed new alternative idea
if there will be problem with processing xls/csv, we make one more stage where application save raw data,
datastage will extract data from this application database and then import it to the destination source

Posted: Wed May 18, 2016 6:20 am
by chulett
There's no "problem" with processing those files, per se. CSV files should be trivial, native Excel files on a UNIX machine much less so. And the file management could be quite extensive but all of that is perfectly doable.

Your alternative idea sounds much... easier. :wink:

Posted: Wed May 18, 2016 7:16 am
by johnboy3
At the risk of looking bad, can someone explain his ending, alternative solution? I'm fairly new to DS.

Thanks,
john3

Posted: Wed May 18, 2016 7:33 am
by chulett
He just means have the app changed to push the data to staging / interface tables specific to this processing and then have DS source from that database directly, turning it from an Excel Files Goat Rodeo to a much more standard "ETL" process.

Most times you won't have the option of getting the source system changed to be more ETL-friendly, usually you are stuck with whatever they do (or do not) provide and have to suck it up and ride the bull. :wink: