Page 1 of 1

how to remove the duplicate records

Posted: Thu Jun 02, 2005 8:10 am
by Prithivi
Hi,

How can i remove the duplicate records by using a sequential stage componenet.
my flow is like this...

Sequential stage--->transformer--->sequential stage/Oracle OCI


suppose i am using a flat file as a source and it has having some duplicate records.I need to remove those duplicate records in tranformer stage and want to insert the clean records into the target file or target table.

I need your help .Please give me some idea how to comeout from this problem.Its very urgent to me.


Regards
Prithivi

Posted: Thu Jun 02, 2005 8:16 am
by ArndW
Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.

If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.

Posted: Thu Jun 02, 2005 8:17 am
by Sainath.Srinivasan
You can do a sort with stage variable check or use agg if the data volume is low.

There are many other ways and everything depends on detailed analysis of what you are doing and what you wish to achieve.

Posted: Thu Jun 02, 2005 8:43 am
by Prithivi
ArndW wrote:Use a UNIX level sort (or if you really want to, use a sort stage) to sort your input data - optionally the sort program can and will remove duplicate records for you.

If your data is sorted, then you can use a stage variable in a transform stage to compare the current record with the previously read one and to not pass than on to the subsequent stage.

prithivi-- Can u tell me briefly.I have used the sort stage and getting the data in sorted order.Then after that how can i check the duplicate records through the stage variable.

need more infomation about it.

Prithivi

Posted: Thu Jun 02, 2005 9:47 am
by amsh76
If your volume is not that high, you can always write the records to HF, but make sure you sort them before writing.

HF will remove the duplicates for you.

Posted: Thu Jun 02, 2005 9:51 am
by kris
You already have needful information above. Are you trying to do what you are trying to do? Or you want someone to do it for you ?

Here is one solution:

Use filter command(sort command) in sequential file stage.

IN SEQFileStage-------->Xfm-------->OUT SEQFileStage

Open IN SEQFileStage and click on stage tab and check on Stage uses fileter command. Now click on the output tab and write your sort command in filter command box.

Your sort command should be:
sort -u <positions of sort keys>

You don't have to redirect it to newfile. It will read from stdin.

This fileter command will dedupe your input file. And you will write the resultant records to another file.

Kris~

Posted: Thu Jun 02, 2005 9:52 am
by martin
hi amsh76,

You can Use RowProcCompareWithPreviousValue Rotine in SatgeVariable or as Contraint to remove duplictes.

GoodLuck :)

Posted: Thu Jun 02, 2005 9:56 am
by ArndW
I would use three stage variables in the order as given

(a) CurrentValue = {current column or columns concatenated}
(b) SameAsLast = IF (LastValue = CurrentValue) THEN 1 ELSE 0
(c) LastValue = CurrentValue

And in your constraint put NOT(SameAsLast)

Posted: Thu Jun 02, 2005 10:17 am
by Sainath.Srinivasan
Note : sort -u as such performs a full row comparison.

Posted: Thu Jun 02, 2005 1:10 pm
by Sunshine2323
Hi,

Please refer to the below post for more answers

viewtopic.php?t=92746&highlight=duplicate

Hope this helps :)

Posted: Thu Jun 02, 2005 1:45 pm
by kris
Sainath.Srinivasan wrote:Note : sort -u as such performs a full row comparison.
We can specify positions as well and dedupe occordingly.

Example on fixed width file: sort on two keys with priority order, one being from position 45 to 57 and other being from position 1 to 2

Code: Select all

 sort -u +0.44 -0.57 +0.0 -0.3
Kris~