Route duplicate and non duplicate records from flat file.

shaonli · Post by **shaonli** » Mon Aug 27, 2007 1:06 am

I have a flat file which contains duplicate and non duplicate records.
Want to route the all duplicate records in one table and all non duplicate records in another table in a single job.
Eg:
SOURCE FILE:
ID Name Marks
1 A 10
1 A 20
2 B 10
2 B 30
2 B 10
3 C 10
4 D 10

NON-DUPLICATE TABLE:
ID Name Marks
3 C 10
4 D 10

DUPLICATE TABLE
ID Name Marks
1 A 10
1 A 20
2 B 10
2 B 30
2 B 10

Please suggest.

ray.wurlod · Post by **ray.wurlod** » Mon Aug 27, 2007 1:16 am

Load them into a UV table. Take two links out of the UV table each grouped by the first two columns; one constrained on HAVING COUNT(*) = 1 and the other constrained on HAVING COUNT(*) > 1. Constraints here are in the SELECT statements that extract rows from the UV table, not in DataStage.

Abu@0403 · Post by **Abu@0403** » Mon Aug 27, 2007 1:34 am

Have a job like this

1) Dummy-File ---------->Hashed File(Just create empty hash file - H1)
...................................... ^
...................................... |
...................................... |lookup
...................................... |
2) I/P File --------------> Trns----> H1(Same hash file)
.....................................................^
.....................................................|
.....................................................|lookup
.....................................................|
.....................................................|
3) I/P File---------------------->Trns--------> Duplicates
...............................................|
...............................................|----------->Non - Duplicates

In 1) Just create an empty hash file using dummy input. Just have a single dummy field in input. In the hash file have two fields (Id,Flag).

In 2) Have your I/P seq file records passed to hash file(H1). But before passing just look if it already present in hash file, if not then "Y" Else "N".

So at the end of 2) your hash file will have like

Id Flag
1 N
2 N
3 Y
4 Y

In 3) Pass the records which has "Y" to Non-duplicate & "N" to Duplicates.

Please check if this will work. If you are not able to understand any part of my explanation, mention it.

Do not consider the dots I have used, it is just to give the picture of the job.

shaonli · Post by **shaonli** » Mon Aug 27, 2007 10:57 pm

Hi Abu,

Thanks for yr suggestion.
But with this option I am not able to track the first record of the duplicate groups.
If I go with your option then
DUPLICATE TABLE WILL HAVE

1 A 20
2 B 30
2 B 10

NON DUPLICATE TABLE WILL HAVE
1 A 10 ------------------It should not be populated in this table.
2 B 10 ------------------It should not be populated in this table.
3 C 10
4 D 10

For this first record of the duplicate group it will not be present in the hash file.So it will be flagged as N.(Non Duplicate)
The second record will be flagged as Y(Duplicate) as that combination(First record of duplicate group in the hash file) will be already present in the hash file.

So this option is not working properly.
Please suggest.

Thanks
Shaonli

bkumar103 · Post by **bkumar103** » Mon Aug 27, 2007 11:58 pm

Aggregator and transformer can be used to find out the duplicate and non duplicate records.

Suppose file1 is the original sequential file.
use the aggregator stage to find the no of occurances of the records based on the key column. If there is no key or multiple key concatenate all the colmn using transformer before gettion the count using aggregator.
Now filter the records based on the occurances, If the count is > 1 then route the record to one hash file, rest record to other hash file.
In another job do the lookup based on these two hash file to the original sequential file and write to two different file ot table.

Hope this will help.

ray.wurlod · Post by **ray.wurlod** » Tue Aug 28, 2007 1:12 am

I still like my solution.

Abu@0403 · Post by **Abu@0403** » Tue Aug 28, 2007 3:51 am

I think you have not understood my solution properly.
If you see my previous reply.
I have mentioned that the hash file would have like

Id Flag
1 N
2 N
3 Y
4 Y

If you are getting these values in your hash file.
Then it means that Id's having 1 & 2 have duplicates and 3 & 4 are unique records.

So asper 3) you have to compare this hash file with the input sequential using Id as key.

Then pass the records with flag "Y" into Non-duplicates & "N" into duplicates.

So your output would be

Non-Duplicates
==========

3 C 10
4 D 10

Duplicates
=======

1 A 10
1 A 20
2 B 10
2 B 30
2 B 10

This last job(3 in my previous post) will have two outputs from the transformer, one for duplicates & the other for Non-duplicates. There you can use the above conditions in the constraints.

Are you able to understand my solution.

Abu@0403 · Post by **Abu@0403** » Tue Aug 28, 2007 3:54 am

I forgot to mention this in my previous post.
Since the hash file has Id as key, so the previous "Y" for 1 & 2 will be overwritten by "N" when the duplicates come. So the hash file would have only the content that I have mentioned.

Have I cleared your doubt.

ray.wurlod · Post by **ray.wurlod** » Tue Aug 28, 2007 4:36 pm

The hashed file (or UV table) requires two key columns in my solution.

shaonli · Post by **shaonli** » Wed Aug 29, 2007 4:03 am

Thanks Abu.I am able to do it following your steps.

Abu@0403 · Post by **Abu@0403** » Wed Aug 29, 2007 4:15 am

You are welcome

DSXchange

Route duplicate and non duplicate records from flat file.

Route duplicate and non duplicate records from flat file.

Re: Route duplicate and non duplicate records from flat file