How to Write a Job to Hanlde Thousand Files in Several Runs

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
olgc
Participant
Posts: 145
Joined: Tue Nov 18, 2003 9:00 am

How to Write a Job to Hanlde Thousand Files in Several Runs

Post by olgc »

Hi there,

how can I write a job to handle, like to say, thousand files in a folder. We don't want to process all these files in one batch, but want to each run to limit to process 10 million records in these files. Suppose the total records of these files exceed 100 millions.

With this job, two we need to solve:

1. Count records in each file
2. Pick files for each run to guarantee that these files contain arround 10 million reecords, and each file has to be processed one and only one time.

Thanks,
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Since the sequential files stages are designed to process a whole file at a time, I would write a script (outside of datastage) which processes your input data into files of the correct line counts. This is easily done in UNIX scripting with tools such as cat and split
olgc
Participant
Posts: 145
Joined: Tue Nov 18, 2003 9:00 am

Post by olgc »

Thanks, ArndW. Here is another approach with better performance:

For issue 1: some body said using wc to calculate the number of rows. Such as

wc -l ./*.csv|awk 'BEGIN {print "count\tfilename"} {printf "%d\t|%s\n", $1,$2}'.

I don't like this approach because it needs to read the entire file to count its number of rows. Below is an approach to estimate the number of rows in a file:

Use command "ls -l *.csv" to dump the directory to a file. In this file, we have each file name and its size. After examining each files serveral times, we can figure out a ratio of row count with its size for each file, and producing a file of list of all files with its estimate row count. By processing this file, we can control what files need to process for a run. No need to read file to count its number of rows, so has better performance.

Thanks
Post Reply