Route duplicate and non duplicate records from flat file.

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
shaonli
Participant
Posts: 80
Joined: Tue Nov 28, 2006 6:52 am

Route duplicate and non duplicate records from flat file.

Post by shaonli »

I have a flat file which contains duplicate and non duplicate records.
Want to route the all duplicate records in one table and all non duplicate records in another table in a single job.
Eg:
SOURCE FILE:
ID Name Marks
1 A 10
1 A 20
2 B 10
2 B 30
2 B 10
3 C 10
4 D 10

NON-DUPLICATE TABLE:
ID Name Marks
3 C 10
4 D 10

DUPLICATE TABLE
ID Name Marks
1 A 10
1 A 20
2 B 10
2 B 30
2 B 10

Please suggest.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Load them into a UV table. Take two links out of the UV table each grouped by the first two columns; one constrained on HAVING COUNT(*) = 1 and the other constrained on HAVING COUNT(*) > 1. Constraints here are in the SELECT statements that extract rows from the UV table, not in DataStage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Abu@0403
Participant
Posts: 32
Joined: Wed Aug 08, 2007 11:21 pm

Re: Route duplicate and non duplicate records from flat file

Post by Abu@0403 »

Have a job like this

1) Dummy-File ---------->Hashed File(Just create empty hash file - H1)
...................................... ^
...................................... |
...................................... |lookup
...................................... |
2) I/P File --------------> Trns----> H1(Same hash file)
.....................................................^
.....................................................|
.....................................................|lookup
.....................................................|
.....................................................|
3) I/P File---------------------->Trns--------> Duplicates
...............................................|
...............................................|----------->Non - Duplicates


In 1) Just create an empty hash file using dummy input. Just have a single dummy field in input. In the hash file have two fields (Id,Flag).

In 2) Have your I/P seq file records passed to hash file(H1). But before passing just look if it already present in hash file, if not then "Y" Else "N".

So at the end of 2) your hash file will have like

Id Flag
1 N
2 N
3 Y
4 Y

In 3) Pass the records which has "Y" to Non-duplicate & "N" to Duplicates.


Please check if this will work. If you are not able to understand any part of my explanation, mention it.

Do not consider the dots I have used, it is just to give the picture of the job.
Last edited by Abu@0403 on Mon Aug 27, 2007 1:46 am, edited 1 time in total.
----------------
Abu
shaonli
Participant
Posts: 80
Joined: Tue Nov 28, 2006 6:52 am

Post by shaonli »

Hi Abu,

Thanks for yr suggestion.
But with this option I am not able to track the first record of the duplicate groups.
If I go with your option then
DUPLICATE TABLE WILL HAVE

1 A 20
2 B 30
2 B 10

NON DUPLICATE TABLE WILL HAVE
1 A 10 ------------------It should not be populated in this table.
2 B 10 ------------------It should not be populated in this table.
3 C 10
4 D 10

For this first record of the duplicate group it will not be present in the hash file.So it will be flagged as N.(Non Duplicate)
The second record will be flagged as Y(Duplicate) as that combination(First record of duplicate group in the hash file) will be already present in the hash file.

So this option is not working properly.
Please suggest.

Thanks
Shaonli
bkumar103
Participant
Posts: 214
Joined: Wed Jul 25, 2007 2:29 am
Location: Chennai

Post by bkumar103 »

Aggregator and transformer can be used to find out the duplicate and non duplicate records.

Suppose file1 is the original sequential file.
use the aggregator stage to find the no of occurances of the records based on the key column. If there is no key or multiple key concatenate all the colmn using transformer before gettion the count using aggregator.
Now filter the records based on the occurances, If the count is > 1 then route the record to one hash file, rest record to other hash file.
In another job do the lookup based on these two hash file to the original sequential file and write to two different file ot table.

Hope this will help.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I still like my solution.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Abu@0403
Participant
Posts: 32
Joined: Wed Aug 08, 2007 11:21 pm

Post by Abu@0403 »

I think you have not understood my solution properly.
If you see my previous reply.
I have mentioned that the hash file would have like

Id Flag
1 N
2 N
3 Y
4 Y

If you are getting these values in your hash file.
Then it means that Id's having 1 & 2 have duplicates and 3 & 4 are unique records.

So asper 3) you have to compare this hash file with the input sequential using Id as key.

Then pass the records with flag "Y" into Non-duplicates & "N" into duplicates.

So your output would be

Non-Duplicates
==========

3 C 10
4 D 10

Duplicates
=======

1 A 10
1 A 20
2 B 10
2 B 30
2 B 10

This last job(3 in my previous post) will have two outputs from the transformer, one for duplicates & the other for Non-duplicates. There you can use the above conditions in the constraints.

Are you able to understand my solution.
----------------
Abu
Abu@0403
Participant
Posts: 32
Joined: Wed Aug 08, 2007 11:21 pm

Post by Abu@0403 »

I forgot to mention this in my previous post.
Since the hash file has Id as key, so the previous "Y" for 1 & 2 will be overwritten by "N" when the duplicates come. So the hash file would have only the content that I have mentioned.

Have I cleared your doubt.
----------------
Abu
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The hashed file (or UV table) requires two key columns in my solution.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
shaonli
Participant
Posts: 80
Joined: Tue Nov 28, 2006 6:52 am

Post by shaonli »

Thanks Abu.I am able to do it following your steps.
Abu@0403
Participant
Posts: 32
Joined: Wed Aug 08, 2007 11:21 pm

Post by Abu@0403 »

You are welcome
----------------
Abu
Post Reply