Single big CSV file to many small CSV files

tsamui · Post by **tsamui** » Sun Jul 15, 2007 7:49 am

Hi All,
I have big CSV files and I have to devide it into many small CSV files depending on the size of the input file. In every out file first row column names is true.
As a example I have one input file sample.CSV which has 250 rows. ETL will process the file and make 3 outfiles sample1.CSV, sample2.CSV, sample3.CSV. First two files should have 100 rows each and 3rd file should contain last 50 rows.
Is this posible in datastage?

Hemant_Kulkarni · Post by **Hemant_Kulkarni** » Sun Jul 15, 2007 12:33 pm

I think you are on Unix.

I would use unix scripting to do this for me.

You can write a shell script to check the number of lines in the file (wc -l )
and for every 100 (or what ever may be the number of rows), write to a new file (using head & cut commands for example)

Hemant_Kulkarni · Post by **Hemant_Kulkarni** » Sun Jul 15, 2007 12:34 pm

I think you are on Unix.

I would use unix scripting to do this for me.

You can write a shell script to check the number of lines in the file (wc -l )
and for every 100 (or what ever may be the number of rows), write to a new file (using head & cut commands for example)

ray.wurlod · Post by **ray.wurlod** » Sun Jul 15, 2007 1:21 pm

Is the number of lines in the large CSV file known in advance, or arbitrary? Is there a maximum value that it may have? The answers to those questions will affect any suggested technique.

tsamui · Post by **tsamui** » Mon Jul 16, 2007 1:15 am

Thanks for the reply.

The number of lines is not known and maximum may be anything.

Is there any way I can divide one CSV file to four output CSV files?
If this is possible, that will be a good solution for me.

ArndW · Post by **ArndW** » Mon Jul 16, 2007 2:47 am

I would recomend using the UNIX split command and splitting the file into n-line or n-byte chunks. It is fast, effieicent and a builtin function in every UNIX flavor.

katz · Post by **katz** » Mon Jul 16, 2007 5:16 am

If you're on UNIX, then you can start multiple concurrent readers of the same input csv file; one reader for each output file you want (e.g. 4). Use AWK to filter the set of rows in each reader based on the modulus of the input row, setting the filter in each reader for a different modulus value. For example, to break the input csv file into 4 output streams use the AWK commands:

Code: Select all

 awk 'NR%4 == 0'

Code: Select all

awk 'NR%4 == 1'

Code: Select all

awk 'NR%4 == 2'

Code: Select all

awk 'NR%4 == 3'

Place the AWK statements in the Filter command box of the reader sequential file stage (be sure to select Stage uses Filter commands), and use a different AWK statement in each. This way you can be certain that every input row goes through one, and only one reader.

Once you have filtered the input file into separate streams you can save them as mutiple csv files, with or without headers, as you want.

If you're on Windows, then there are versions of AWK available, but I don't have any experience with any of them so cannot advise on how well they work. In UNIX I have used the above technique to a read large sequential file in parallel.