Single big CSV file to many small CSV files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
tsamui
Participant
Posts: 23
Joined: Fri May 04, 2007 3:05 am
Location: Kolkata

Single big CSV file to many small CSV files

Post by tsamui »

Hi All,
I have big CSV files and I have to devide it into many small CSV files depending on the size of the input file. In every out file first row column names is true.
As a example I have one input file sample.CSV which has 250 rows. ETL will process the file and make 3 outfiles sample1.CSV, sample2.CSV, sample3.CSV. First two files should have 100 rows each and 3rd file should contain last 50 rows.
Is this posible in datastage?
Hemant_Kulkarni
Premium Member
Premium Member
Posts: 50
Joined: Tue Jan 02, 2007 1:40 am

Post by Hemant_Kulkarni »

I think you are on Unix.

I would use unix scripting to do this for me.

You can write a shell script to check the number of lines in the file (wc -l )
and for every 100 (or what ever may be the number of rows), write to a new file (using head & cut commands for example)
Hemant_Kulkarni
Premium Member
Premium Member
Posts: 50
Joined: Tue Jan 02, 2007 1:40 am

Post by Hemant_Kulkarni »

I think you are on Unix.

I would use unix scripting to do this for me.

You can write a shell script to check the number of lines in the file (wc -l )
and for every 100 (or what ever may be the number of rows), write to a new file (using head & cut commands for example)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is the number of lines in the large CSV file known in advance, or arbitrary? Is there a maximum value that it may have? The answers to those questions will affect any suggested technique.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tsamui
Participant
Posts: 23
Joined: Fri May 04, 2007 3:05 am
Location: Kolkata

Post by tsamui »

Thanks for the reply.

The number of lines is not known and maximum may be anything.

Is there any way I can divide one CSV file to four output CSV files?
If this is possible, that will be a good solution for me.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I would recomend using the UNIX split command and splitting the file into n-line or n-byte chunks. It is fast, effieicent and a builtin function in every UNIX flavor.
katz
Charter Member
Charter Member
Posts: 52
Joined: Thu Jan 20, 2005 8:13 am

Post by katz »

If you're on UNIX, then you can start multiple concurrent readers of the same input csv file; one reader for each output file you want (e.g. 4). Use AWK to filter the set of rows in each reader based on the modulus of the input row, setting the filter in each reader for a different modulus value. For example, to break the input csv file into 4 output streams use the AWK commands:

Code: Select all

 awk 'NR%4 == 0' 

Code: Select all

awk 'NR%4 == 1'  

Code: Select all

awk 'NR%4 == 2'  

Code: Select all

awk 'NR%4 == 3'  
Place the AWK statements in the Filter command box of the reader sequential file stage (be sure to select Stage uses Filter commands), and use a different AWK statement in each. This way you can be certain that every input row goes through one, and only one reader.

Once you have filtered the input file into separate streams you can save them as mutiple csv files, with or without headers, as you want.

If you're on Windows, then there are versions of AWK available, but I don't have any experience with any of them so cannot advise on how well they work. In UNIX I have used the above technique to a read large sequential file in parallel.
Post Reply