Page 1 of 1

How to divide a file into 8 parts

Posted: Mon Dec 05, 2005 9:41 am
by ganive
Hi All,

I need to divide data into 8 distinct files, based on the file size.
Example :
1 - Data Source : A Flat File named "TOTO.txt"
2 - Many Transform, check, and other operations...
3 - Data Target : TOTO1.txt, TOTO2.txt... TOTO8.txt
(I would be great if each file had the same size, but I don't know if it's possible).

To resolve the problem, I wanted to use the stage "File Set", where you can specify the option "Maximum File Size".
My problem here is that I don't know how I can create a File Set.
If anyone has an example, you're welcome !!

Maybe the divide operation can be done in a Flat file Stage via UNIX command.
If anyone knows an adequat command, he is welcome.

If there are other possibilities to resolve my problem... you're welcome too !! ;)

++

Re: How to divide a file into 8 parts

Posted: Mon Dec 05, 2005 9:58 am
by Ultramundane
Will your records be fixed width? If not, do you require that each file contain a full record?

Re: How to divide a file into 8 parts

Posted: Mon Dec 05, 2005 10:04 am
by ganive
Yes, the records I have to generate are fixed width.
Ultramundane wrote:Will your records be fixed width? If not, do you require that each file contain a full record?

Re: How to divide a file into 8 parts

Posted: Mon Dec 05, 2005 11:51 am
by Ultramundane
You might be able to use the external target stage like the example I gave in the wrapped stage thread. You can use the following code fragment to split the file into as many files as you specify up to 99. The default record seperator for awk is a line feed. If you need to modify this you can specify RS and/or ORS as the value needed. You could do this by specifying them like the FS and OFS settings.


Code: Select all

cat - \
 | awk -v FILE="${1}" -v SC="${2}" -v AFS="${3}" 'BEGIN { FC=0;FS=AFS;OFS=AFS; }
     {
     if ( FC >= SC )
     {
       FC=0;
     }
     FC=FC+1;
     if ( FC <=9 )
     {
       OFILE=FILE"."0FC;
     }
     else
     {
       OFILE=FILE"."FC;
     }

     print $0>OFILE;
}'
Example:
example_awk.ksh "myfile.txt" "8" "?"

Would produce up to 8 files.

myfile.txt.1
myfile.txt.2
myfile.txt.3
myfile.txt.4
myfile.txt.5
myfile.txt.6
myfile.txt.7
myfile.txt.8

These files get a record in a round robin fashion.
That is, records
( n - 1 ) % 8 = 0 would go to file 1
( n - 2 ) % 8 = 1 would go to file 2
( n - 3 ) % 8 = 2 would go to file 3
( n - 4 ) % 8 = 3 would go to file 4
( n - 5 ) % 8 = 4 would go to file 5
( n - 6 ) % 8 = 5 would go to file 6
( n - 7 ) % 8 = 6 would go to file 7
( n - 8 ) % 8 = 7 would go to file 8

Posted: Mon Dec 05, 2005 1:29 pm
by djm
Such is the beauty of unix is that there are many ways to skin this cat.

The csplit command will achieve the same effect, on the premise that you want the first 1/8th of the rows in the first file, the next 1/8th in the second file, etc. The computation of the number of lines can easily be achieved with command substitution, perhaps using the "wc -l".

Try "man" on "csplit", "ksh" (if you are unsure about command substitution and presupposing your are using the ksh) and "wc".

David

Posted: Mon Dec 05, 2005 1:46 pm
by ray.wurlod
You create a File Set by writing to a File Set stage.

Specify eight-way partitioning in a configuration file.

Run job.

Done.

Posted: Mon Dec 05, 2005 2:24 pm
by djm
Okay, so that serves me right for dipping my toe into the DSEE forum!

I hope you aren't back in Oz Ray because if you are, you posted that reply no later than 6:45 a.m.!

Posted: Mon Dec 05, 2005 3:33 pm
by ganive
What's the syntax used in a File Set configuration file ??
I only know it the configuration file should end in .fs :s
ray.wurlod wrote:You create a File Set by writing to a File Set stage.

Specify eight-way partitioning in a configuration file.

Run job.

Done.

Posted: Mon Dec 05, 2005 4:20 pm
by pavankvk
generate a unique sequence number using the surrogate key stage,then have a transformer,then use Mod() function and check value to 1 thru 8. u shud have a 8-way configuration file.

Posted: Mon Dec 05, 2005 7:49 pm
by ray.wurlod
There's no other syntax. The File Set control file has a name that conventionally ends in ".fs". The individual data files reside on the disk resource specified for each processing node in the configuration file.

Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.

Posted: Tue Dec 06, 2005 4:17 am
by ganive
Okay, understood.
I think the File Set Stage doesn't fit my needs.
I thought it was something I could Use to create a group of Flat File in a particular directory whereas it seems to be similar to Dataset stage (a control file pointing on one or more data files).

Seems I have 2 Ways to solve my problem :

1 - Creating a Flat File and using the awk command above to divide it into parts.
2 - Using the combo Surrogate Key / Mod() Function in order to load eight distinct files in a job.

ThX

ray.wurlod wrote:There's no other syntax. The File Set control file has a name that conventionally ends in ".fs". The individual data files reside on the disk resource specified for each processing node in the configuration file.

Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.