Split a source file into multiple files dynamically

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
its_jkm
Premium Member
Premium Member
Posts: 4
Joined: Mon Aug 08, 2005 9:26 am

Split a source file into multiple files dynamically

Post by its_jkm »

Hi ,

I have following requirement:

I have source file with 10 columns.
I need to split the file into target files to be used downstream dynamically based on the selective columns specified in a metadata table for the target files

Can you please let me know how this can be done in datastage.

Example:

Code: Select all

Source file: A.csv

empno, empname,age,telephone number,salary

Metadata Table 1:

Source file Source column 
A.csv         empno
              empname
              age
              telephone 
              Salary

Metadata Table 2:

Source file Target File Target Column 
A.csv        X.csv         empno
                           empname
A.csv        Y.csv         empno
                           age 
                           salary 
So in this scenario I need the job to split source file a.csv into x.csv and y.csv which can increase to n number of files with different set of columns.

It need to be a generic job.
its_jkm
Premium Member
Premium Member
Posts: 4
Joined: Mon Aug 08, 2005 9:26 am

Post by its_jkm »

to clarify:

x.csv should have data for empno,empname
y.csv should have data for empno,age,salary
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Define what you mean by 'generic' here, please.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mouthou
Participant
Posts: 208
Joined: Sun Jul 04, 2004 11:57 pm

Re: Split a source file into multiple files dynamically

Post by mouthou »

As far as my understanding goes, I guess this may be achieved by using MODIFY stage which handles the dynamic selection & dropping of columns. Please check the documentation for the stage usage.

If the DataStage implementation gets complicated, as another resort unix scripting will also help as the source is the file and its the matter of picking the column wanted. Just pass the field# to Cut command -f option or awk command

source file
-bash-4.3$ cat a.txt
col1,col2,col3,col4
1,2,3,4
11,22,33,44


-bash-4.3$ cut -d',' -f1,4 a.txt
col1,col4
1,4
11,44
-bash-4.3$
its_jkm
Premium Member
Premium Member
Posts: 4
Joined: Mon Aug 08, 2005 9:26 am

Re: Split a source file into multiple files dynamically

Post by its_jkm »

Generic mean...

We need a single job that read input file once and splits into target files.

There will be different set of input files with different metadata so the job should read the input file based on the metadata defined and split into target files as defined in the meteadata table.

The example I gave is just one instance.
Example 2:

Input file
B.csv
company name, turnover, location, profilt,COMPANY CODE

Output files:

xx.csv
company code,company name

yy.csv
company code, location,turnover

zz.csv
companycode, company name, profit



In awk as well, can we read the file ones and split into multiple target files depending on the requirement defined in the metadata table?

please let me know if you need any further info
naveenkumar.ssn
Participant
Posts: 36
Joined: Thu Dec 03, 2009 9:11 pm
Location: Malaysia

Re: Split a source file into multiple files dynamically

Post by naveenkumar.ssn »

Hi,

This can be performed using schema files in datastage.

Create the schema file for the target as per your requirement.

Regards
Naveen
Naveen Kumar
Datastage Consultant
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

You could also try using a combination of a peek stage and a lookup then outputting the values to an external target stage.

Eg: Set a peek to return all records, all columns, all partitions, and the output mode to Output (Rather than log), Show column names should be true. This will output the data in the format
ColumnName:Value|ColumnName:Value etc (where | is the delimter as set in the peek stage (under options)). You should then be able to split the values into column name and value columns in a transformer stage, and perform a subsequent lookup on the column name, keeping only those where there is a match. Also return the target file name.

If you then concat all the required output columns and put the fully qualified target file name as the first column, then you should be able to then use an external target stage with a simple awk command to split the records into as many files you require.
Post Reply