Hi All,
I have big CSV files and I have to devide it into many small CSV files depending on the size of the input file. In every out file first row column names is true.
As a example I have one input file sample.CSV which has 250 rows. ETL will process the file and make 3 outfiles sample1.CSV, sample2.CSV, sample3.CSV. First two files should have 100 rows each and 3rd file should contain last 50 rows.
Is this posible in datastage?
Single big CSV file to many small CSV files
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 50
- Joined: Tue Jan 02, 2007 1:40 am
-
- Premium Member
- Posts: 50
- Joined: Tue Jan 02, 2007 1:40 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Is the number of lines in the large CSV file known in advance, or arbitrary? Is there a maximum value that it may have? The answers to those questions will affect any suggested technique.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I would recomend using the UNIX split command and splitting the file into n-line or n-byte chunks. It is fast, effieicent and a builtin function in every UNIX flavor.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
If you're on UNIX, then you can start multiple concurrent readers of the same input csv file; one reader for each output file you want (e.g. 4). Use AWK to filter the set of rows in each reader based on the modulus of the input row, setting the filter in each reader for a different modulus value. For example, to break the input csv file into 4 output streams use the AWK commands:
Place the AWK statements in the Filter command box of the reader sequential file stage (be sure to select Stage uses Filter commands), and use a different AWK statement in each. This way you can be certain that every input row goes through one, and only one reader.
Once you have filtered the input file into separate streams you can save them as mutiple csv files, with or without headers, as you want.
If you're on Windows, then there are versions of AWK available, but I don't have any experience with any of them so cannot advise on how well they work. In UNIX I have used the above technique to a read large sequential file in parallel.
Code: Select all
awk 'NR%4 == 0'
Code: Select all
awk 'NR%4 == 1'
Code: Select all
awk 'NR%4 == 2'
Code: Select all
awk 'NR%4 == 3'
Once you have filtered the input file into separate streams you can save them as mutiple csv files, with or without headers, as you want.
If you're on Windows, then there are versions of AWK available, but I don't have any experience with any of them so cannot advise on how well they work. In UNIX I have used the above technique to a read large sequential file in parallel.