Issues in selecting the last occurance in the datastage job
Moderators: chulett, rschirm, roy
Issues in selecting the last occurance in the datastage job
Hi,
I have a datastage job which uses a dataset as the source (this inturn is created by a file that comes from mainframes) and the job has to remove duplicates based a column and retain the last occurance.
In the mainframe file there are 3 occurances for the same column. Basically there are some other columns that are different in these occurances. Once the datastage job is complete the job loads the unique records to a dataset and then inserts to a table.
The issue here is - the last occurance record what I see in the mainframe file is diffferent than the one I see in the table.
Some times the job picks first occurance, some times the last occurance, the configuration file I use has 4 nodes.
Can some one please help me explain why the job is not picking the last occurance correctly?
thanks,
Vij
I have a datastage job which uses a dataset as the source (this inturn is created by a file that comes from mainframes) and the job has to remove duplicates based a column and retain the last occurance.
In the mainframe file there are 3 occurances for the same column. Basically there are some other columns that are different in these occurances. Once the datastage job is complete the job loads the unique records to a dataset and then inserts to a table.
The issue here is - the last occurance record what I see in the mainframe file is diffferent than the one I see in the table.
Some times the job picks first occurance, some times the last occurance, the configuration file I use has 4 nodes.
Can some one please help me explain why the job is not picking the last occurance correctly?
thanks,
Vij
-
- Premium Member
- Posts: 278
- Joined: Wed Oct 03, 2007 8:45 am
the job flow is like this:
Dataset1->copy stage->remove duplicate ->dataset2
Dataset1 has records with duplicate key columns. In the copy stage there is a hash partition on the key column and in the remove duplciate there is same partition and sorting is done based on the key column and the duplicate to retain is Last occurance and they are written to the dataset2, this dataset2 has issues, some times it removes duplciates and retains the last occurance and sometimes is retains first occurance.
thanks,
Vij
Dataset1->copy stage->remove duplicate ->dataset2
Dataset1 has records with duplicate key columns. In the copy stage there is a hash partition on the key column and in the remove duplciate there is same partition and sorting is done based on the key column and the duplicate to retain is Last occurance and they are written to the dataset2, this dataset2 has issues, some times it removes duplciates and retains the last occurance and sometimes is retains first occurance.
thanks,
Vij
-
- Premium Member
- Posts: 278
- Joined: Wed Oct 03, 2007 8:45 am
Dataset1->copy stage->remove duplicate ->dataset2
Is Dataset1 a datastage dataset or a sequential file FTP'd from the MF?
If it is a datastage dataset then how is it partitioned?
Try to use some peek stages to see the data or add an output from the copy stage to a file to see if it will indicate the order of the data.
Since this is a parrallel job using a 4 node configuration file, you will need to control the sort and partitioning to ensure the order of the data as it flows through the job stream to ensure you get the results you are looking for.
Is Dataset1 a datastage dataset or a sequential file FTP'd from the MF?
If it is a datastage dataset then how is it partitioned?
Try to use some peek stages to see the data or add an output from the copy stage to a file to see if it will indicate the order of the data.
Since this is a parrallel job using a 4 node configuration file, you will need to control the sort and partitioning to ensure the order of the data as it flows through the job stream to ensure you get the results you are looking for.
-
- Premium Member
- Posts: 278
- Joined: Wed Oct 03, 2007 8:45 am