Hi,
We are processing a file which has more than 1500 columns in it.
Please let me know the complications we might be facing while processing these types of files.
Planned approach for processing.
1) Split the file into 5 pieces (for ex 1-300 columns in first file 301-600 columns in second file & so on..)
2) Add rownum in each splitted file and after processing join it back and load to the target/staging table.
Please let me know if our approach will cause any issues later.
We can process the file without splitting it.( I believe datastage will not put any limitations) We want to make our life easier while testing, so thought of this approach.
Please advise if there is a better alternative.
Thanks for your help.
Thanks & Regards,
Shamanth VK
Processing a file which has more than 1500 columns
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 9
- Joined: Sat Jul 14, 2007 10:47 am
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 58
- Joined: Tue Nov 17, 2009 3:38 am
adding the record number after splitting a file is not a good idea.
There are many ways by which the order of the records might get changed.
For example :
a) Using multiple readers option in sequential file stage will change the order in which the records of files are read.
b) What if a record is dropped by the transformer stage?
Even repartioning may change the order of records.
To avoid all these problems,it is always better to add a serial number to all records before splitting the file.
here is the algo :
a) Add a column in the last of the original file say "Serial_No".
b) Assign incremental number to all of the records of the original file(containing 1500 columns)
c) Once the serial numbers are assigned to the records,you can split the file into files of 300 columns.make sure that the column "Serial_No" from the original file is appended to each file.
This will make sure that if any order of records changes or any record is dropped,the serial number will be adjusted accordingly.
Steps a) and b) can be easily achieved using a simple shell script.
And yes, @inrownum and @outrownum produce duplicate values when the job runs on more than one nodes.
There are many ways by which the order of the records might get changed.
For example :
a) Using multiple readers option in sequential file stage will change the order in which the records of files are read.
b) What if a record is dropped by the transformer stage?
Even repartioning may change the order of records.
To avoid all these problems,it is always better to add a serial number to all records before splitting the file.
here is the algo :
a) Add a column in the last of the original file say "Serial_No".
b) Assign incremental number to all of the records of the original file(containing 1500 columns)
c) Once the serial numbers are assigned to the records,you can split the file into files of 300 columns.make sure that the column "Serial_No" from the original file is appended to each file.
This will make sure that if any order of records changes or any record is dropped,the serial number will be adjusted accordingly.
Steps a) and b) can be easily achieved using a simple shell script.
And yes, @inrownum and @outrownum produce duplicate values when the job runs on more than one nodes.
Nikhil