Is there any in built function compare the values in a col
Moderators: chulett, rschirm, roy
Is there any in built function compare the values in a col
i need to compare the values in a single column of seq. file and get all the records out which are duplicates.Is there any in built function available to compare the values in a column.
Can any body help me in this .
Can any body help me in this .
Define 'duplicates'. They share a common key, certain data fields are equal, all fields are equal? The answer could change the approach.
Couple of thoughts...
a) Why not leverage your operating system? UNIX can sort a sequential file and remove duplicates using the 'sort -u' command. You may need to help it out and tell it what parts to consider when figuring uniqueness, otherwise it will only consider the entire row. This could be used in the Filter option of the Sequential Stage to keep everything in your job.
b) You could use a Hash file to remove duplicates based on keys. Hash files are 'destructive overwrites' so last one in wins. Write all rows to it and then when you read from it, all duplicates will have been removed.
c) If your input file is (or can be) sorted, then you can use Stage Variables to check for repeating groups, only allowing the first occurance through the Transformer.
All of these have been discussed here quite a bit, try searching the forums for whichever one seems to fit your needs. And to get back to your original question, no there really isn't a built in function to compare values... unless you count some of the operands like = or <> Seriously, simply compare values using standard If-Then-Else logic with the appropriate operands. If you need to compare current value with previous value, use a Stage Variable to hold the old value so you've got something to compare to. This is part of the technique mentioned as (c) above.
Good luck. Try to make some of this work for you and if you still need help, come back with specific questions.
Couple of thoughts...
a) Why not leverage your operating system? UNIX can sort a sequential file and remove duplicates using the 'sort -u' command. You may need to help it out and tell it what parts to consider when figuring uniqueness, otherwise it will only consider the entire row. This could be used in the Filter option of the Sequential Stage to keep everything in your job.
b) You could use a Hash file to remove duplicates based on keys. Hash files are 'destructive overwrites' so last one in wins. Write all rows to it and then when you read from it, all duplicates will have been removed.
c) If your input file is (or can be) sorted, then you can use Stage Variables to check for repeating groups, only allowing the first occurance through the Transformer.
All of these have been discussed here quite a bit, try searching the forums for whichever one seems to fit your needs. And to get back to your original question, no there really isn't a built in function to compare values... unless you count some of the operands like = or <> Seriously, simply compare values using standard If-Then-Else logic with the appropriate operands. If you need to compare current value with previous value, use a Stage Variable to hold the old value so you've got something to compare to. This is part of the technique mentioned as (c) above.
Good luck. Try to make some of this work for you and if you still need help, come back with specific questions.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Welcome aboard! :D
Given that it's a single column in a sequential file, preprocess using sort with the -k option to sort by that column.
It's then very easy to use a pair of stage variables to detect when the value in that column changes. This technique has been presented before on this forum - do a search on "duplicate" and "stage variable". Basically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:
This works because SV1 is evaluated before SV2. You may need to append an extra, dummy line to your input file (echo IGNORE >> file) to force the final data line to be output.
Given that it's a single column in a sequential file, preprocess using sort with the -k option to sort by that column.
It's then very easy to use a pair of stage variables to detect when the value in that column changes. This technique has been presented before on this forum - do a search on "duplicate" and "stage variable". Basically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:
Code: Select all
SV1 InLink.TheColumn <> SV2 And @INROWNUM <> 1
SV2 InLink.TheColumn
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sorry, for some silly reason I missed the 'single column' statement.
If you do decide to go the UNIX route and pre-sort the file, take the extra step and add the -u to the sort command. Sorted and "de-duped" all in one swell foop. Keep the Stage Variables in the wings for the next problem.
If you do decide to go the UNIX route and pre-sort the file, take the extra step and add the -u to the sort command. Sorted and "de-duped" all in one swell foop. Keep the Stage Variables in the wings for the next problem.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Constraint to eliminate Duplicates
I have around 10 columns in a Sorted table, in which two columns are unique indexes and what I want to do is I want to compare the data of both the columns with previos rows to eliminate duplicates. Like both columns data shouldn't be equal in two rows.
I tried using the stage variables looking at Ray's solution but couldn't figure out what should I write in the constraint to reject the dups in two columns.
Can any one help me how to write the stage variable and constraint to eliminate duplicates in those two columns.
thanks
I tried using the stage variables looking at Ray's solution but couldn't figure out what should I write in the constraint to reject the dups in two columns.
Can any one help me how to write the stage variable and constraint to eliminate duplicates in those two columns.
thanks
-
- Participant
- Posts: 74
- Joined: Tue Sep 30, 2003 4:25 am
- Location: Malaysia
Hallo Ray,
Can you please elaborate further on the usage of the stage variables below?
Can you please elaborate further on the usage of the stage variables below?
ThanzBasically, define two stage variables (let's call them SV1 and SV2). SV1 determines whether the value has changed, SV2 stores the value from that column from the previous row. Derivations:
Code:
SV1 InLink.TheColumn <> SV2 And @INROWNUM <> 1
SV2 InLink.TheColumn
This works because SV1 is evaluated before SV2. You may need to append an extra, dummy line to your input file (echo IGNORE >> file) to force the final data line to be output.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 122
- Joined: Mon Jul 05, 2004 1:33 pm
- Location: MA