How to divide a file into 8 parts

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ganive
Participant
Posts: 18
Joined: Wed Sep 28, 2005 7:06 am

How to divide a file into 8 parts

Post by ganive »

Hi All,

I need to divide data into 8 distinct files, based on the file size.
Example :
1 - Data Source : A Flat File named "TOTO.txt"
2 - Many Transform, check, and other operations...
3 - Data Target : TOTO1.txt, TOTO2.txt... TOTO8.txt
(I would be great if each file had the same size, but I don't know if it's possible).

To resolve the problem, I wanted to use the stage "File Set", where you can specify the option "Maximum File Size".
My problem here is that I don't know how I can create a File Set.
If anyone has an example, you're welcome !!

Maybe the divide operation can be done in a Flat file Stage via UNIX command.
If anyone knows an adequat command, he is welcome.

If there are other possibilities to resolve my problem... you're welcome too !! ;)

++
--------
GaNoU
--------
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Re: How to divide a file into 8 parts

Post by Ultramundane »

Will your records be fixed width? If not, do you require that each file contain a full record?
ganive
Participant
Posts: 18
Joined: Wed Sep 28, 2005 7:06 am

Re: How to divide a file into 8 parts

Post by ganive »

Yes, the records I have to generate are fixed width.
Ultramundane wrote:Will your records be fixed width? If not, do you require that each file contain a full record?
--------
GaNoU
--------
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Re: How to divide a file into 8 parts

Post by Ultramundane »

You might be able to use the external target stage like the example I gave in the wrapped stage thread. You can use the following code fragment to split the file into as many files as you specify up to 99. The default record seperator for awk is a line feed. If you need to modify this you can specify RS and/or ORS as the value needed. You could do this by specifying them like the FS and OFS settings.


Code: Select all

cat - \
 | awk -v FILE="${1}" -v SC="${2}" -v AFS="${3}" 'BEGIN { FC=0;FS=AFS;OFS=AFS; }
     {
     if ( FC >= SC )
     {
       FC=0;
     }
     FC=FC+1;
     if ( FC <=9 )
     {
       OFILE=FILE"."0FC;
     }
     else
     {
       OFILE=FILE"."FC;
     }

     print $0>OFILE;
}'
Example:
example_awk.ksh "myfile.txt" "8" "?"

Would produce up to 8 files.

myfile.txt.1
myfile.txt.2
myfile.txt.3
myfile.txt.4
myfile.txt.5
myfile.txt.6
myfile.txt.7
myfile.txt.8

These files get a record in a round robin fashion.
That is, records
( n - 1 ) % 8 = 0 would go to file 1
( n - 2 ) % 8 = 1 would go to file 2
( n - 3 ) % 8 = 2 would go to file 3
( n - 4 ) % 8 = 3 would go to file 4
( n - 5 ) % 8 = 4 would go to file 5
( n - 6 ) % 8 = 5 would go to file 6
( n - 7 ) % 8 = 6 would go to file 7
( n - 8 ) % 8 = 7 would go to file 8
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

Such is the beauty of unix is that there are many ways to skin this cat.

The csplit command will achieve the same effect, on the premise that you want the first 1/8th of the rows in the first file, the next 1/8th in the second file, etc. The computation of the number of lines can easily be achieved with command substitution, perhaps using the "wc -l".

Try "man" on "csplit", "ksh" (if you are unsure about command substitution and presupposing your are using the ksh) and "wc".

David
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You create a File Set by writing to a File Set stage.

Specify eight-way partitioning in a configuration file.

Run job.

Done.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

Okay, so that serves me right for dipping my toe into the DSEE forum!

I hope you aren't back in Oz Ray because if you are, you posted that reply no later than 6:45 a.m.!
ganive
Participant
Posts: 18
Joined: Wed Sep 28, 2005 7:06 am

Post by ganive »

What's the syntax used in a File Set configuration file ??
I only know it the configuration file should end in .fs :s
ray.wurlod wrote:You create a File Set by writing to a File Set stage.

Specify eight-way partitioning in a configuration file.

Run job.

Done.
--------
GaNoU
--------
pavankvk
Participant
Posts: 202
Joined: Thu Dec 04, 2003 7:54 am

Post by pavankvk »

generate a unique sequence number using the surrogate key stage,then have a transformer,then use Mod() function and check value to 1 thru 8. u shud have a 8-way configuration file.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's no other syntax. The File Set control file has a name that conventionally ends in ".fs". The individual data files reside on the disk resource specified for each processing node in the configuration file.

Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ganive
Participant
Posts: 18
Joined: Wed Sep 28, 2005 7:06 am

Post by ganive »

Okay, understood.
I think the File Set Stage doesn't fit my needs.
I thought it was something I could Use to create a group of Flat File in a particular directory whereas it seems to be similar to Dataset stage (a control file pointing on one or more data files).

Seems I have 2 Ways to solve my problem :

1 - Creating a Flat File and using the awk command above to divide it into parts.
2 - Using the combo Surrogate Key / Mod() Function in order to load eight distinct files in a job.

ThX

ray.wurlod wrote:There's no other syntax. The File Set control file has a name that conventionally ends in ".fs". The individual data files reside on the disk resource specified for each processing node in the configuration file.

Yes, I get up early - usually about 5:00 (too many years in the military!). 6:45 isn't early, though - I leave for work at 07:05.
--------
GaNoU
--------
Post Reply