Page 1 of 1

Issues in selecting the last occurance in the datastage job

Posted: Thu Apr 30, 2009 11:25 am
by vij
Hi,

I have a datastage job which uses a dataset as the source (this inturn is created by a file that comes from mainframes) and the job has to remove duplicates based a column and retain the last occurance.

In the mainframe file there are 3 occurances for the same column. Basically there are some other columns that are different in these occurances. Once the datastage job is complete the job loads the unique records to a dataset and then inserts to a table.

The issue here is - the last occurance record what I see in the mainframe file is diffferent than the one I see in the table.

Some times the job picks first occurance, some times the last occurance, the configuration file I use has 4 nodes.

Can some one please help me explain why the job is not picking the last occurance correctly?

thanks,
Vij

Posted: Thu Apr 30, 2009 12:40 pm
by mikegohl
What are you partition and sort keys?

Posted: Thu Apr 30, 2009 1:19 pm
by vij
mikegohl , thanks for replying.

the Key column is partitioned on Hash and there is no sorting.

Posted: Thu Apr 30, 2009 1:19 pm
by vij
mikegohl , thanks for replying.

It is Hash partitioned on the key column and there is no sorting.

Posted: Thu Apr 30, 2009 1:34 pm
by sjfearnside
Can you provide some more input on your job stream, stages, sorts and partitions?

Posted: Thu Apr 30, 2009 1:59 pm
by vij
the job flow is like this:

Dataset1->copy stage->remove duplicate ->dataset2

Dataset1 has records with duplicate key columns. In the copy stage there is a hash partition on the key column and in the remove duplciate there is same partition and sorting is done based on the key column and the duplicate to retain is Last occurance and they are written to the dataset2, this dataset2 has issues, some times it removes duplciates and retains the last occurance and sometimes is retains first occurance.

thanks,
Vij

Posted: Thu Apr 30, 2009 2:07 pm
by mikegohl
If you are talking about the original order on the mainframe, you are going to have to work harder to keep the original order.

Posted: Thu Apr 30, 2009 2:21 pm
by vij
ok, this means that the I can't preseve the same sorting in the datastage?

Posted: Thu Apr 30, 2009 2:27 pm
by mikegohl
You can, but you have to think very carefully about it.

Posted: Thu Apr 30, 2009 2:33 pm
by sjfearnside
Dataset1->copy stage->remove duplicate ->dataset2

Is Dataset1 a datastage dataset or a sequential file FTP'd from the MF?

If it is a datastage dataset then how is it partitioned?

Try to use some peek stages to see the data or add an output from the copy stage to a file to see if it will indicate the order of the data.

Since this is a parrallel job using a 4 node configuration file, you will need to control the sort and partitioning to ensure the order of the data as it flows through the job stream to ensure you get the results you are looking for.

Posted: Thu Apr 30, 2009 2:38 pm
by sjfearnside
Also check to see if datastage is inserting sorts. Use a dump score to see what datastage is doing, i.e. is it inserting sorts or partitioning in the job stream?