Issues in selecting the last occurance in the datastage job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Issues in selecting the last occurance in the datastage job

Post by vij »

Hi,

I have a datastage job which uses a dataset as the source (this inturn is created by a file that comes from mainframes) and the job has to remove duplicates based a column and retain the last occurance.

In the mainframe file there are 3 occurances for the same column. Basically there are some other columns that are different in these occurances. Once the datastage job is complete the job loads the unique records to a dataset and then inserts to a table.

The issue here is - the last occurance record what I see in the mainframe file is diffferent than the one I see in the table.

Some times the job picks first occurance, some times the last occurance, the configuration file I use has 4 nodes.

Can some one please help me explain why the job is not picking the last occurance correctly?

thanks,
Vij
mikegohl
Premium Member
Premium Member
Posts: 97
Joined: Fri Jun 13, 2003 12:50 pm
Location: Chicago
Contact:

Post by mikegohl »

What are you partition and sort keys?
Michael Gohl
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

mikegohl , thanks for replying.

the Key column is partitioned on Hash and there is no sorting.
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

mikegohl , thanks for replying.

It is Hash partitioned on the key column and there is no sorting.
sjfearnside
Premium Member
Premium Member
Posts: 278
Joined: Wed Oct 03, 2007 8:45 am

Post by sjfearnside »

Can you provide some more input on your job stream, stages, sorts and partitions?
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

the job flow is like this:

Dataset1->copy stage->remove duplicate ->dataset2

Dataset1 has records with duplicate key columns. In the copy stage there is a hash partition on the key column and in the remove duplciate there is same partition and sorting is done based on the key column and the duplicate to retain is Last occurance and they are written to the dataset2, this dataset2 has issues, some times it removes duplciates and retains the last occurance and sometimes is retains first occurance.

thanks,
Vij
mikegohl
Premium Member
Premium Member
Posts: 97
Joined: Fri Jun 13, 2003 12:50 pm
Location: Chicago
Contact:

Post by mikegohl »

If you are talking about the original order on the mainframe, you are going to have to work harder to keep the original order.
Michael Gohl
vij
Participant
Posts: 131
Joined: Fri Nov 17, 2006 12:43 am

Post by vij »

ok, this means that the I can't preseve the same sorting in the datastage?
mikegohl
Premium Member
Premium Member
Posts: 97
Joined: Fri Jun 13, 2003 12:50 pm
Location: Chicago
Contact:

Post by mikegohl »

You can, but you have to think very carefully about it.
Michael Gohl
sjfearnside
Premium Member
Premium Member
Posts: 278
Joined: Wed Oct 03, 2007 8:45 am

Post by sjfearnside »

Dataset1->copy stage->remove duplicate ->dataset2

Is Dataset1 a datastage dataset or a sequential file FTP'd from the MF?

If it is a datastage dataset then how is it partitioned?

Try to use some peek stages to see the data or add an output from the copy stage to a file to see if it will indicate the order of the data.

Since this is a parrallel job using a 4 node configuration file, you will need to control the sort and partitioning to ensure the order of the data as it flows through the job stream to ensure you get the results you are looking for.
sjfearnside
Premium Member
Premium Member
Posts: 278
Joined: Wed Oct 03, 2007 8:45 am

Post by sjfearnside »

Also check to see if datastage is inserting sorts. Use a dump score to see what datastage is doing, i.e. is it inserting sorts or partitioning in the job stream?
Post Reply