Issues in selecting the last occurance in the datastage job

vij · Post by **vij** » Thu Apr 30, 2009 11:25 am

Hi,

I have a datastage job which uses a dataset as the source (this inturn is created by a file that comes from mainframes) and the job has to remove duplicates based a column and retain the last occurance.

In the mainframe file there are 3 occurances for the same column. Basically there are some other columns that are different in these occurances. Once the datastage job is complete the job loads the unique records to a dataset and then inserts to a table.

The issue here is - the last occurance record what I see in the mainframe file is diffferent than the one I see in the table.

Some times the job picks first occurance, some times the last occurance, the configuration file I use has 4 nodes.

Can some one please help me explain why the job is not picking the last occurance correctly?

thanks,
Vij

mikegohl · Post by **mikegohl** » Thu Apr 30, 2009 12:40 pm

What are you partition and sort keys?

vij · Post by **vij** » Thu Apr 30, 2009 1:19 pm

mikegohl , thanks for replying.

the Key column is partitioned on Hash and there is no sorting.

vij · Post by **vij** » Thu Apr 30, 2009 1:19 pm

mikegohl , thanks for replying.

It is Hash partitioned on the key column and there is no sorting.

sjfearnside · Post by **sjfearnside** » Thu Apr 30, 2009 1:34 pm

Can you provide some more input on your job stream, stages, sorts and partitions?

vij · Post by **vij** » Thu Apr 30, 2009 1:59 pm

the job flow is like this:

Dataset1->copy stage->remove duplicate ->dataset2

Dataset1 has records with duplicate key columns. In the copy stage there is a hash partition on the key column and in the remove duplciate there is same partition and sorting is done based on the key column and the duplicate to retain is Last occurance and they are written to the dataset2, this dataset2 has issues, some times it removes duplciates and retains the last occurance and sometimes is retains first occurance.

thanks,
Vij

mikegohl · Post by **mikegohl** » Thu Apr 30, 2009 2:07 pm

If you are talking about the original order on the mainframe, you are going to have to work harder to keep the original order.

vij · Post by **vij** » Thu Apr 30, 2009 2:21 pm

ok, this means that the I can't preseve the same sorting in the datastage?

mikegohl · Post by **mikegohl** » Thu Apr 30, 2009 2:27 pm

You can, but you have to think very carefully about it.

sjfearnside · Post by **sjfearnside** » Thu Apr 30, 2009 2:33 pm

Dataset1->copy stage->remove duplicate ->dataset2

Is Dataset1 a datastage dataset or a sequential file FTP'd from the MF?

If it is a datastage dataset then how is it partitioned?

Try to use some peek stages to see the data or add an output from the copy stage to a file to see if it will indicate the order of the data.

Since this is a parrallel job using a 4 node configuration file, you will need to control the sort and partitioning to ensure the order of the data as it flows through the job stream to ensure you get the results you are looking for.

sjfearnside · Post by **sjfearnside** » Thu Apr 30, 2009 2:38 pm

Also check to see if datastage is inserting sorts. Use a dump score to see what datastage is doing, i.e. is it inserting sorts or partitioning in the job stream?