Removing Duplicates In Datastage Px
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 18
- Joined: Tue Nov 15, 2005 9:36 am
Removing Duplicates In Datastage Px
Hi all,
I have a scenario to remove duplicates.
there are 257573 records coming in from source table and when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage
only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.
And the key column used for checking duplicates is a direct move from table to the final data set.
Also when i send 30 records by restricting the input records i get 30 records as output.
I mean that the output is not crossing 41 in any case.
I have a scenario to remove duplicates.
there are 257573 records coming in from source table and when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage
only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.
And the key column used for checking duplicates is a direct move from table to the final data set.
Also when i send 30 records by restricting the input records i get 30 records as output.
I mean that the output is not crossing 41 in any case.
Did you make sure your data going into the remove dups stage was hash partitioned and sorted on the right keys?
If you use the sort stage with teh Unique option, then the keys will match that of the hash partitioning.
If you use the remove dups stage, then you will a) hash on the keys, b) sort on the keys and then (optionally) any additional fields that may impact deduping, and finally c) call the remove dups on the original keys.
What do I mean in b? Say you have a key of account. For some reason there are multiple start dates for some accounts and that is what is causing the duplicates. If you know that the earliest date is the one you want to keep, then sort not only on account, but also on the start date. If you sort the date ascending, then the earliest date is first so in the remove dups stage keep the first occurence of teh account. Conversely, if you sort descending, then keep the last occurence in the remove dups stage.
If you don't care which occurence you keep, then use the sort stage to do it all.
That's my 2 cents... Hope it helps.
Brad.
If you use the sort stage with teh Unique option, then the keys will match that of the hash partitioning.
If you use the remove dups stage, then you will a) hash on the keys, b) sort on the keys and then (optionally) any additional fields that may impact deduping, and finally c) call the remove dups on the original keys.
What do I mean in b? Say you have a key of account. For some reason there are multiple start dates for some accounts and that is what is causing the duplicates. If you know that the earliest date is the one you want to keep, then sort not only on account, but also on the start date. If you sort the date ascending, then the earliest date is first so in the remove dups stage keep the first occurence of teh account. Conversely, if you sort descending, then keep the last occurence in the remove dups stage.
If you don't care which occurence you keep, then use the sort stage to do it all.
That's my 2 cents... Hope it helps.
Brad.
Hmm, that would certainly be faster and easier. Unless he has to do something with the duplicates that are identifies - good data goes one direction, dup data goes the other. Of course, it would be faster to run 2 SQLs too, one that select good data and one that selects dups.DSguru2B wrote:I say remove duplicates in the select itself.
You know, DataStage should create a SQL interface to datasets and data streams. If you are pulling data from a database, you can do it easy enough. But what if you are doing just file processing? It would be nice to be able to run ANSI SQL against the data stream.
Maybe Hawk+5?
Brad.
There is a way, load it into a temp database table and go wild with the sql
Jokes apart, that is a very good enhancement piece. ODBC can connect to flat files but thats limited to windows. For unix, they want $$$. Someone should come up with a cool utility that can run sql queries on a flat file.
Jokes apart, that is a very good enhancement piece. ODBC can connect to flat files but thats limited to windows. For unix, they want $$$. Someone should come up with a cool utility that can run sql queries on a flat file.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
more info, please Re: Removing Duplicates In Datastage Px
(edited: as per kumar_s, put the * in "select distinct" examples, below.)
You said "key column", as in just one key.
Cool - that makes it easier.
You said when you "select distinct" you're
getting all 257573 as distinct.... Hmm.
Can you post that SQL query?
Ok... Color me skeptical but... do these
queries both return the same value?
If you were doing just "select distinct * from sometable"
then it easilly could bring back every row, especially if
you have some kind of seq# or insert-timestamp
in the records - e.g. "distinctness" is based on unique combinations
of all columns for each row in your results set.
Consider these examples
If you'd like to see your frequency distribution of unique
values, try something like this:
What is your record definition? (I'm reminded of
the mantra, "know your data"^3 ).
What is your job structure,
smth like this for testing?
SQL DataSource --> RemoveDups --> SeqFile
What is the SQL that is generated (or specified/built) for
the data source?
In the remove-duplicates stage, what are you settings
for...
Stage.Properties ?
Keys That Define Duplicates:
Options
Duplicate To Retain: ?
Input.Partitioning ?
In particular, what are your:
Partition type:
Sorting: Perform? Stable? Unique?
John G.
You said "key column", as in just one key.
Cool - that makes it easier.
You said when you "select distinct" you're
getting all 257573 as distinct.... Hmm.
Can you post that SQL query?
Ok... Color me skeptical but... do these
queries both return the same value?
Code: Select all
select count(*) from sometable --> 257573
select count(distinct key) from sometable --> 257573
then it easilly could bring back every row, especially if
you have some kind of seq# or insert-timestamp
in the records - e.g. "distinctness" is based on unique combinations
of all columns for each row in your results set.
Consider these examples
Code: Select all
--- foo --- . SELECT DISTINCT * FROM FOO . SELECT DISTINCT SEQ FROM FOO
SEQ COLOR . yields this result set: . yields this result set:
1 red . SEQ COLOR . SEQ
1 blue . 1 red . 1
1 green . 1 blue . 2
2 red . 1 green .
2 blue . 2 red . SELECT DISTINCT COLOR FROM FOO
. 2 blue . would yield this result set:
. . COLOR
. . red
. . blue
. . green
....................................................................
If you'd like to see your frequency distribution of unique
values, try something like this:
Code: Select all
select count(*) as cnt, key
from sometable
group by key
order by cnt desc, key
What is your record definition? (I'm reminded of
the mantra, "know your data"^3 ).
What is your job structure,
smth like this for testing?
SQL DataSource --> RemoveDups --> SeqFile
What is the SQL that is generated (or specified/built) for
the data source?
In the remove-duplicates stage, what are you settings
for...
Stage.Properties ?
Keys That Define Duplicates:
Options
Duplicate To Retain: ?
Input.Partitioning ?
In particular, what are your:
Partition type:
Sorting: Perform? Stable? Unique?
John G.
bhargava129189 wrote:...when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage
only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.
Last edited by jgreve on Thu Jan 11, 2007 4:23 pm, edited 2 times in total.
oops r.e. SELECT DISTINCT FROM TABLE
oopskumar_s wrote:I still doubt, whether SELECT DISTINCT FROM TABLE will give out records without any syntax error.
That would work better with some column specifiers,
like a * or smth (and by "better", I mean, "as opposed
to not at all").
Thanks.
I've tweaked the original example.