Reading Remote datafile without using FTP Stage !!!!!

murur · Post by **murur** » Wed Apr 21, 2004 10:38 am

Is there any other way EXCEPT using FTP Stage to read and write datafile from remote server in a DataStage Server Job

DataStage Server is running on UNIX server
Source/Traget files reside in WindowNT server

kcbland · Post by **kcbland** » Wed Apr 21, 2004 11:54 am

Never use this stage, it's a sales gimmick. Use an FTP command line script to move files locally, and then distribute when done. The FTP stage is not practical because the only reason to use it for large files is to prevent copying the file locally, except, a large file will take FOREVER to read using this stage via network, making the likelyhood of a dropped connection increase.

ray.wurlod · Post by **ray.wurlod** » Wed Apr 21, 2004 4:45 pm

If you can mount the disks, for example via SAMBA, then they are accessible. In older versions of DataStage you may need to enable NFS via the ALLOWNFS configuration parameter.

Or, as Ken suggests, yoy can create a shell script that retrieves the file to the local machine (perhaps using rcp, perhaps using FTP) before processing it with DataStage. Such a script could be run from a before-job subroutine (ExecSH) or a Command Activity in a sequence.

kcbland · Post by **kcbland** » Wed Apr 21, 2004 8:09 pm

I wanted to share this private email with everyone so that we all benefit from the discussion:

Just noted your response in the DSXchange to request for a method to transport files via method other than FTP.

You say never to use the FTP stage as it is a gimmick. We use this stage fairly extensively throughout our system (DS 5.2 on Unix) without any problems. Granted most file sizes are relatively small but we very rarely have any hitches with the jobs.

I am curious to understand why we should not be using it when we have had very few problems with it.

I would very much appreciate your advice on this as we are looking at an upgrade to version 7.1 very shortly and if there is good reason to change the FTP stage's then now would be the time.

Please let me know.

The FTP stage "reads" the file, while an command line FTP preserves the file without prejudice in the transfer. So, using the FTP stage to just "move" a file involves "reading" and "writing" rows and columns, whereas command line FTP doesn't care about content.

For small volume files, an NFS or Samba mount is really elegant. The network performance is not so critical as the volume is low. This means that you use the Sequential stage, which has a lot more features and is easier to work with.

For large volume sources, you must consider using multiple job instance designs to parallel process the source data. In order to parallel process a source data set, you need to be able to "partition" or "cut" the data into equal groups. You can't do this if the data is remote via FTP and it's really difficult if it's in a table. In a local sequential file, you have many options and it's really easy.

Basically, if high volumes are your concern, parallel job instances is your solution. Having the data local is required for maximum throughput inbound and outbound. Once the data is produced, moving it remotely is optimally achieved using command line FTP, whereby compression and dedicated transfer can take place.

If the FTP stage is working for you, great. But, you may want to benchmark using the stage on a 30GB 50 million row remote file. Just write a simple job that parses the file and eliminates some columns and write the output. Benchmark the FTP stage as the reader versus a command line transfer and then a Sequential stage as the reader in the transform job.

Don't even mention restart capability. If your process dies 1/2 way thru, you will re-incur the full transfer, whereas the localized file gives you the ability to have the job skip rows if you build in restart from an @INROWNUM to job parameter constraint check. That same check of FTP versus Sequential is lightyears in difference on performance.