Is there any in built function compare the values in a col

narsingrp · Post by **narsingrp** » Sat Apr 03, 2004 8:19 pm

i need to compare the values in a single column of seq. file and get all the records out which are duplicates.Is there any in built function available to compare the values in a column.

Can any body help me in this .

chulett · Post by **chulett** » Sun Apr 04, 2004 12:04 am

Define 'duplicates'. They share a common key, certain data fields are equal, all fields are equal? The answer could change the approach.

Couple of thoughts...

a) Why not leverage your operating system? UNIX can sort a sequential file and remove duplicates using the 'sort -u' command. You may need to help it out and tell it what parts to consider when figuring uniqueness, otherwise it will only consider the entire row. This could be used in the Filter option of the Sequential Stage to keep everything in your job.

b) You could use a Hash file to remove duplicates based on keys. Hash files are 'destructive overwrites' so last one in wins. Write all rows to it and then when you read from it, all duplicates will have been removed.

c) If your input file is (or can be) sorted, then you can use Stage Variables to check for repeating groups, only allowing the first occurance through the Transformer.

All of these have been discussed here quite a bit, try searching the forums for whichever one seems to fit your needs. And to get back to your original question, no there really isn't a built in function to compare values... unless you count some of the operands like = or <>

Seriously, simply compare values using standard If-Then-Else logic with the appropriate operands. If you need to compare current value with previous value, use a Stage Variable to hold the old value so you've got something to compare to. This is part of the technique mentioned as (c) above.

Good luck. Try to make some of this work for you and if you still need help, come back with specific questions.

ray.wurlod · Post by **ray.wurlod** » Sun Apr 04, 2004 1:41 am

Welcome aboard! :D

Given that it's a single column in a sequential file, preprocess using sort with the -k option to sort by that column.
It's then very easy to use a pair of stage variables to detect when the value in that column changes. This technique has been presented before on this forum - do a search on "duplicate" and "stage variable". Basically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:

Code: Select all

SV1   InLink.TheColumn <> SV2 And @INROWNUM <> 1
SV2   InLink.TheColumn

This works because SV1 is evaluated before SV2. You may need to append an extra, dummy line to your input file (echo IGNORE >> file) to force the final data line to be output.

chulett · Post by **chulett** » Sun Apr 04, 2004 8:30 am

Sorry, for some silly reason I missed the 'single column' statement.

If you do decide to go the UNIX route and pre-sort the file, take the extra step and add the -u to the sort command. Sorted and "de-duped" all in one swell foop. Keep the Stage Variables in the wings for the next problem.

narsingrp · Post by **narsingrp** » Sun Apr 04, 2004 10:45 am

Thanks Craig and Ray.I appreciate your help.I will try this out.

NEO · Post by **NEO** » Wed Apr 07, 2004 3:57 pm

I have around 10 columns in a Sorted table, in which two columns are unique indexes and what I want to do is I want to compare the data of both the columns with previos rows to eliminate duplicates. Like both columns data shouldn't be equal in two rows.

I tried using the stage variables looking at Ray's solution but couldn't figure out what should I write in the constraint to reject the dups in two columns.

Can any one help me how to write the stage variable and constraint to eliminate duplicates in those two columns.

thanks

Christina Lim · Post by **Christina Lim** » Thu Jul 08, 2004 7:54 pm

Hallo Ray,

Can you please elaborate further on the usage of the stage variables below?

Basically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:
Code:
SV1 InLink.TheColumn <> SV2 And @INROWNUM <> 1
SV2 InLink.TheColumn

This works because SV1 is evaluated before SV2. You may need to append an extra, dummy line to your input file (echo IGNORE >> file) to force the final data line to be output.

Thanz

ray.wurlod · Post by **ray.wurlod** » Thu Jul 08, 2004 10:41 pm

SV1 detects whether a change has occurred.

SV2 remembers the value from the previous row processed (while SV1 is being evaluated).

dls · Post by **dls** » Fri Jul 09, 2004 6:59 am

NEO, Craig's recommendation to use sort -u should apply in your case, as well. Your sort involves two(2) key fields instead of one(1).

-Dave

sonia jacob · Post by **sonia jacob** » Fri Jul 09, 2004 9:06 am

Using aggregator stage to find the rows with duplicate columns values is one other method that I used. I used the count(coulmn value) > 1 condition to identify the key columns that are being duplicated.

Hope it helps

DSXchange

Is there any in built function compare the values in a col

Is there any in built function compare the values in a col

Constraint to eliminate Duplicates