Processing a file which has more than 1500 columns

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Shamanth_vk
Premium Member
Premium Member
Posts: 9
Joined: Sat Jul 14, 2007 10:47 am

Processing a file which has more than 1500 columns

Post by Shamanth_vk »

Hi,

We are processing a file which has more than 1500 columns in it.

Please let me know the complications we might be facing while processing these types of files.

Planned approach for processing.

1) Split the file into 5 pieces (for ex 1-300 columns in first file 301-600 columns in second file & so on..)
2) Add rownum in each splitted file and after processing join it back and load to the target/staging table.

Please let me know if our approach will cause any issues later.

We can process the file without splitting it.( I believe datastage will not put any limitations) We want to make our life easier while testing, so thought of this approach.

Please advise if there is a better alternative.

Thanks for your help.

Thanks & Regards,
Shamanth VK
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

Adding ROWNUM in a parallel job will not work as desired.
You need to use db sequences OR use the stages in 'sequential' mode

Regards
Sreeni
nikhilanshuman
Participant
Posts: 58
Joined: Tue Nov 17, 2009 3:38 am

Post by nikhilanshuman »

adding the record number after splitting a file is not a good idea.

There are many ways by which the order of the records might get changed.

For example :
a) Using multiple readers option in sequential file stage will change the order in which the records of files are read.

b) What if a record is dropped by the transformer stage?

Even repartioning may change the order of records.

To avoid all these problems,it is always better to add a serial number to all records before splitting the file.

here is the algo :

a) Add a column in the last of the original file say "Serial_No".
b) Assign incremental number to all of the records of the original file(containing 1500 columns)
c) Once the serial numbers are assigned to the records,you can split the file into files of 300 columns.make sure that the column "Serial_No" from the original file is appended to each file.

This will make sure that if any order of records changes or any record is dropped,the serial number will be adjusted accordingly.

Steps a) and b) can be easily achieved using a simple shell script.

And yes, @inrownum and @outrownum produce duplicate values when the job runs on more than one nodes.
Nikhil
Post Reply