Is there any in built function compare the values in a col

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
narsingrp
Premium Member
Premium Member
Posts: 37
Joined: Wed Jan 21, 2004 10:38 pm

Is there any in built function compare the values in a col

Post by narsingrp »

i need to compare the values in a single column of seq. file and get all the records out which are duplicates.Is there any in built function available to compare the values in a column.

Can any body help me in this .
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Define 'duplicates'. They share a common key, certain data fields are equal, all fields are equal? The answer could change the approach.

Couple of thoughts...

a) Why not leverage your operating system? UNIX can sort a sequential file and remove duplicates using the 'sort -u' command. You may need to help it out and tell it what parts to consider when figuring uniqueness, otherwise it will only consider the entire row. This could be used in the Filter option of the Sequential Stage to keep everything in your job.

b) You could use a Hash file to remove duplicates based on keys. Hash files are 'destructive overwrites' so last one in wins. Write all rows to it and then when you read from it, all duplicates will have been removed.

c) If your input file is (or can be) sorted, then you can use Stage Variables to check for repeating groups, only allowing the first occurance through the Transformer.

All of these have been discussed here quite a bit, try searching the forums for whichever one seems to fit your needs. And to get back to your original question, no there really isn't a built in function to compare values... unless you count some of the operands like = or <> :wink: Seriously, simply compare values using standard If-Then-Else logic with the appropriate operands. If you need to compare current value with previous value, use a Stage Variable to hold the old value so you've got something to compare to. This is part of the technique mentioned as (c) above.

Good luck. Try to make some of this work for you and if you still need help, come back with specific questions.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D

Given that it's a single column in a sequential file, preprocess using sort with the -k option to sort by that column.
It's then very easy to use a pair of stage variables to detect when the value in that column changes. This technique has been presented before on this forum - do a search on "duplicate" and "stage variable". Basically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:

Code: Select all

SV1   InLink.TheColumn <> SV2 And @INROWNUM <> 1
SV2   InLink.TheColumn
This works because SV1 is evaluated before SV2. You may need to append an extra, dummy line to your input file (echo IGNORE >> file) to force the final data line to be output.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sorry, for some silly reason I missed the 'single column' statement. :oops:

If you do decide to go the UNIX route and pre-sort the file, take the extra step and add the -u to the sort command. Sorted and "de-duped" all in one swell foop. Keep the Stage Variables in the wings for the next problem. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
narsingrp
Premium Member
Premium Member
Posts: 37
Joined: Wed Jan 21, 2004 10:38 pm

Post by narsingrp »

Thanks Craig and Ray.I appreciate your help.I will try this out.
NEO
Premium Member
Premium Member
Posts: 163
Joined: Mon Mar 22, 2004 5:49 pm

Constraint to eliminate Duplicates

Post by NEO »

I have around 10 columns in a Sorted table, in which two columns are unique indexes and what I want to do is I want to compare the data of both the columns with previos rows to eliminate duplicates. Like both columns data shouldn't be equal in two rows.

I tried using the stage variables looking at Ray's solution but couldn't figure out what should I write in the constraint to reject the dups in two columns.

Can any one help me how to write the stage variable and constraint to eliminate duplicates in those two columns.

thanks
Christina Lim
Participant
Posts: 74
Joined: Tue Sep 30, 2003 4:25 am
Location: Malaysia

Post by Christina Lim »

Hallo Ray,

Can you please elaborate further on the usage of the stage variables below?
Basically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:
Code:
SV1 InLink.TheColumn <> SV2 And @INROWNUM <> 1
SV2 InLink.TheColumn

This works because SV1 is evaluated before SV2. You may need to append an extra, dummy line to your input file (echo IGNORE >> file) to force the final data line to be output.
Thanz
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

SV1 detects whether a change has occurred.

SV2 remembers the value from the previous row processed (while SV1 is being evaluated).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dls
Premium Member
Premium Member
Posts: 96
Joined: Tue Sep 09, 2003 5:15 pm

Post by dls »

NEO, Craig's recommendation to use sort -u should apply in your case, as well. Your sort involves two(2) key fields instead of one(1).

-Dave
sonia jacob
Participant
Posts: 122
Joined: Mon Jul 05, 2004 1:33 pm
Location: MA

Post by sonia jacob »

Using aggregator stage to find the rows with duplicate columns values is one other method that I used. I used the count(coulmn value) > 1 condition to identify the key columns that are being duplicated.

Hope it helps
Post Reply