How to divide a file into 8 parts
Moderators: chulett, rschirm, roy
How to divide a file into 8 parts
Hi All,
I need to divide data into 8 distinct files, based on the file size.
Example :
1 - Data Source : A Flat File named "TOTO.txt"
2 - Many Transform, check, and other operations...
3 - Data Target : TOTO1.txt, TOTO2.txt... TOTO8.txt
(I would be great if each file had the same size, but I don't know if it's possible).
To resolve the problem, I wanted to use the stage "File Set", where you can specify the option "Maximum File Size".
My problem here is that I don't know how I can create a File Set.
If anyone has an example, you're welcome !!
Maybe the divide operation can be done in a Flat file Stage via UNIX command.
If anyone knows an adequat command, he is welcome.
If there are other possibilities to resolve my problem... you're welcome too !!
++
I need to divide data into 8 distinct files, based on the file size.
Example :
1 - Data Source : A Flat File named "TOTO.txt"
2 - Many Transform, check, and other operations...
3 - Data Target : TOTO1.txt, TOTO2.txt... TOTO8.txt
(I would be great if each file had the same size, but I don't know if it's possible).
To resolve the problem, I wanted to use the stage "File Set", where you can specify the option "Maximum File Size".
My problem here is that I don't know how I can create a File Set.
If anyone has an example, you're welcome !!
Maybe the divide operation can be done in a Flat file Stage via UNIX command.
If anyone knows an adequat command, he is welcome.
If there are other possibilities to resolve my problem... you're welcome too !!
++
--------
GaNoU
--------
GaNoU
--------
-
- Participant
- Posts: 407
- Joined: Mon Jun 27, 2005 8:54 am
- Location: Walker, Michigan
- Contact:
Re: How to divide a file into 8 parts
Will your records be fixed width? If not, do you require that each file contain a full record?
Re: How to divide a file into 8 parts
Yes, the records I have to generate are fixed width.
Ultramundane wrote:Will your records be fixed width? If not, do you require that each file contain a full record?
--------
GaNoU
--------
GaNoU
--------
-
- Participant
- Posts: 407
- Joined: Mon Jun 27, 2005 8:54 am
- Location: Walker, Michigan
- Contact:
Re: How to divide a file into 8 parts
You might be able to use the external target stage like the example I gave in the wrapped stage thread. You can use the following code fragment to split the file into as many files as you specify up to 99. The default record seperator for awk is a line feed. If you need to modify this you can specify RS and/or ORS as the value needed. You could do this by specifying them like the FS and OFS settings.
Example:
example_awk.ksh "myfile.txt" "8" "?"
Would produce up to 8 files.
myfile.txt.1
myfile.txt.2
myfile.txt.3
myfile.txt.4
myfile.txt.5
myfile.txt.6
myfile.txt.7
myfile.txt.8
These files get a record in a round robin fashion.
That is, records
( n - 1 ) % 8 = 0 would go to file 1
( n - 2 ) % 8 = 1 would go to file 2
( n - 3 ) % 8 = 2 would go to file 3
( n - 4 ) % 8 = 3 would go to file 4
( n - 5 ) % 8 = 4 would go to file 5
( n - 6 ) % 8 = 5 would go to file 6
( n - 7 ) % 8 = 6 would go to file 7
( n - 8 ) % 8 = 7 would go to file 8
Code: Select all
cat - \
| awk -v FILE="${1}" -v SC="${2}" -v AFS="${3}" 'BEGIN { FC=0;FS=AFS;OFS=AFS; }
{
if ( FC >= SC )
{
FC=0;
}
FC=FC+1;
if ( FC <=9 )
{
OFILE=FILE"."0FC;
}
else
{
OFILE=FILE"."FC;
}
print $0>OFILE;
}'
example_awk.ksh "myfile.txt" "8" "?"
Would produce up to 8 files.
myfile.txt.1
myfile.txt.2
myfile.txt.3
myfile.txt.4
myfile.txt.5
myfile.txt.6
myfile.txt.7
myfile.txt.8
These files get a record in a round robin fashion.
That is, records
( n - 1 ) % 8 = 0 would go to file 1
( n - 2 ) % 8 = 1 would go to file 2
( n - 3 ) % 8 = 2 would go to file 3
( n - 4 ) % 8 = 3 would go to file 4
( n - 5 ) % 8 = 4 would go to file 5
( n - 6 ) % 8 = 5 would go to file 6
( n - 7 ) % 8 = 6 would go to file 7
( n - 8 ) % 8 = 7 would go to file 8
Such is the beauty of unix is that there are many ways to skin this cat.
The csplit command will achieve the same effect, on the premise that you want the first 1/8th of the rows in the first file, the next 1/8th in the second file, etc. The computation of the number of lines can easily be achieved with command substitution, perhaps using the "wc -l".
Try "man" on "csplit", "ksh" (if you are unsure about command substitution and presupposing your are using the ksh) and "wc".
David
The csplit command will achieve the same effect, on the premise that you want the first 1/8th of the rows in the first file, the next 1/8th in the second file, etc. The computation of the number of lines can easily be achieved with command substitution, perhaps using the "wc -l".
Try "man" on "csplit", "ksh" (if you are unsure about command substitution and presupposing your are using the ksh) and "wc".
David
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There's no other syntax. The File Set control file has a name that conventionally ends in ".fs". The individual data files reside on the disk resource specified for each processing node in the configuration file.
Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.
Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Okay, understood.
I think the File Set Stage doesn't fit my needs.
I thought it was something I could Use to create a group of Flat File in a particular directory whereas it seems to be similar to Dataset stage (a control file pointing on one or more data files).
Seems I have 2 Ways to solve my problem :
1 - Creating a Flat File and using the awk command above to divide it into parts.
2 - Using the combo Surrogate Key / Mod() Function in order to load eight distinct files in a job.
ThX
I think the File Set Stage doesn't fit my needs.
I thought it was something I could Use to create a group of Flat File in a particular directory whereas it seems to be similar to Dataset stage (a control file pointing on one or more data files).
Seems I have 2 Ways to solve my problem :
1 - Creating a Flat File and using the awk command above to divide it into parts.
2 - Using the combo Surrogate Key / Mod() Function in order to load eight distinct files in a job.
ThX
ray.wurlod wrote:There's no other syntax. The File Set control file has a name that conventionally ends in ".fs". The individual data files reside on the disk resource specified for each processing node in the configuration file.
Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.
--------
GaNoU
--------
GaNoU
--------