How to Write a Job to Hanlde Thousand Files in Several Runs

olgc · Post by **olgc** » Tue Jan 22, 2013 2:03 pm

Hi there,

how can I write a job to handle, like to say, thousand files in a folder. We don't want to process all these files in one batch, but want to each run to limit to process 10 million records in these files. Suppose the total records of these files exceed 100 millions.

With this job, two we need to solve:

1. Count records in each file
2. Pick files for each run to guarantee that these files contain arround 10 million reecords, and each file has to be processed one and only one time.

Thanks,

ArndW · Post by **ArndW** » Wed Jan 23, 2013 9:11 am

Since the sequential files stages are designed to process a whole file at a time, I would write a script (outside of datastage) which processes your input data into files of the correct line counts. This is easily done in UNIX scripting with tools such as cat and split

olgc · Post by **olgc** » Wed Jan 23, 2013 10:50 am

Thanks, ArndW. Here is another approach with better performance:

For issue 1: some body said using wc to calculate the number of rows. Such as

wc -l ./*.csv|awk 'BEGIN {print "count\tfilename"} {printf "%d\t|%s\n", $1,$2}'.

I don't like this approach because it needs to read the entire file to count its number of rows. Below is an approach to estimate the number of rows in a file:

Use command "ls -l *.csv" to dump the directory to a file. In this file, we have each file name and its size. After examining each files serveral times, we can figure out a ratio of row count with its size for each file, and producing a file of list of all files with its estimate row count. By processing this file, we can control what files need to process for a run. No need to read file to count its number of rows, so has better performance.

Thanks