Processing a file which has more than 1500 columns

Shamanth_vk · Post by **Shamanth_vk** » Mon Apr 05, 2010 12:41 am

Hi,

We are processing a file which has more than 1500 columns in it.

Please let me know the complications we might be facing while processing these types of files.

Planned approach for processing.

1) Split the file into 5 pieces (for ex 1-300 columns in first file 301-600 columns in second file & so on..)
2) Add rownum in each splitted file and after processing join it back and load to the target/staging table.

Please let me know if our approach will cause any issues later.

We can process the file without splitting it.( I believe datastage will not put any limitations) We want to make our life easier while testing, so thought of this approach.

Please advise if there is a better alternative.

Thanks for your help.

Thanks & Regards,
Shamanth VK

Sreenivasulu · Post by **Sreenivasulu** » Mon Apr 05, 2010 2:57 am

Adding ROWNUM in a parallel job will not work as desired.
You need to use db sequences OR use the stages in 'sequential' mode

Regards
Sreeni

nikhilanshuman · Post by **nikhilanshuman** » Mon Apr 05, 2010 4:40 am

adding the record number after splitting a file is not a good idea.

There are many ways by which the order of the records might get changed.

For example :
a) Using multiple readers option in sequential file stage will change the order in which the records of files are read.

b) What if a record is dropped by the transformer stage?

Even repartioning may change the order of records.

To avoid all these problems,it is always better to add a serial number to all records before splitting the file.

here is the algo :

a) Add a column in the last of the original file say "Serial_No".
b) Assign incremental number to all of the records of the original file(containing 1500 columns)
c) Once the serial numbers are assigned to the records,you can split the file into files of 300 columns.make sure that the column "Serial_No" from the original file is appended to each file.

This will make sure that if any order of records changes or any record is dropped,the serial number will be adjusted accordingly.

Steps a) and b) can be easily achieved using a simple shell script.

And yes, @inrownum and @outrownum produce duplicate values when the job runs on more than one nodes.