Removing Duplicates In Datastage Px

bhargava129189 · Post by **bhargava129189** » Wed Jan 10, 2007 11:22 am

Hi all,

I have a scenario to remove duplicates.

there are 257573 records coming in from source table and when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage

only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.

And the key column used for checking duplicates is a direct move from table to the final data set.

Also when i send 30 records by restricting the input records i get 30 records as output.
I mean that the output is not crossing 41 in any case.

bcarlson · Post by **bcarlson** » Wed Jan 10, 2007 11:42 am

Did you make sure your data going into the remove dups stage was hash partitioned and sorted on the right keys?

If you use the sort stage with teh Unique option, then the keys will match that of the hash partitioning.

If you use the remove dups stage, then you will a) hash on the keys, b) sort on the keys and then (optionally) any additional fields that may impact deduping, and finally c) call the remove dups on the original keys.

What do I mean in b? Say you have a key of account. For some reason there are multiple start dates for some accounts and that is what is causing the duplicates. If you know that the earliest date is the one you want to keep, then sort not only on account, but also on the start date. If you sort the date ascending, then the earliest date is first so in the remove dups stage keep the first occurence of teh account. Conversely, if you sort descending, then keep the last occurence in the remove dups stage.

If you don't care which occurence you keep, then use the sort stage to do it all.

That's my 2 cents... Hope it helps.

Brad.

DSguru2B · Post by **DSguru2B** » Wed Jan 10, 2007 2:54 pm

I say remove duplicates in the select itself.

bcarlson · Post by **bcarlson** » Wed Jan 10, 2007 3:12 pm

DSguru2B wrote:I say remove duplicates in the select itself.

Hmm, that would certainly be faster and easier. Unless he has to do something with the duplicates that are identifies - good data goes one direction, dup data goes the other. Of course, it would be faster to run 2 SQLs too, one that select good data and one that selects dups.

You know, DataStage should create a SQL interface to datasets and data streams. If you are pulling data from a database, you can do it easy enough. But what if you are doing just file processing? It would be nice to be able to run ANSI SQL against the data stream.

Maybe Hawk+5?

Brad.

DSguru2B · Post by **DSguru2B** » Wed Jan 10, 2007 3:18 pm

There is a way, load it into a temp database table and go wild with the sql

Jokes apart, that is a very good enhancement piece. ODBC can connect to flat files but thats limited to windows. For unix, they want $$$. Someone should come up with a cool utility that can run sql queries on a flat file.

chulett · Post by **chulett** » Wed Jan 10, 2007 3:38 pm

You mean like Oracle external tables?

DSguru2B · Post by **DSguru2B** » Wed Jan 10, 2007 4:00 pm

Exaclty like Oracle External Tables.

jgreve · Post by **jgreve** » Wed Jan 10, 2007 4:37 pm

(edited: as per kumar_s, put the * in "select distinct" examples, below.)

You said "key column", as in just one key.
Cool - that makes it easier.

You said when you "select distinct" you're
getting all 257573 as distinct.... Hmm.
Can you post that SQL query?
Ok... Color me skeptical

but... do these
queries both return the same value?

Code: Select all

   select count(*) from sometable  --> 257573 
   select count(distinct key) from sometable --> 257573

If you were doing just "select distinct * from sometable"
then it easilly could bring back every row, especially if
you have some kind of seq# or insert-timestamp
in the records - e.g. "distinctness" is based on unique combinations
of all columns for each row in your results set.
Consider these examples

Code: Select all

--- foo ---   . SELECT DISTINCT * FROM FOO . SELECT DISTINCT SEQ FROM FOO
SEQ   COLOR   . yields this result set:    . yields this result set:
 1    red     .   SEQ   COLOR              .    SEQ
 1    blue    .    1    red                .     1
 1    green   .    1    blue               .     2
 2    red     .    1    green              .
 2    blue    .    2    red                . SELECT DISTINCT COLOR FROM FOO
              .    2    blue               . would yield this result set:
              .                            .    COLOR
              .                            .    red
              .                            .    blue
              .                            .    green
....................................................................

If you'd like to see your frequency distribution of unique
values, try something like this:

Code: Select all

select count(*) as cnt, key
from sometable
group by key
order by cnt desc, key

What is your record definition? (I'm reminded of
the mantra, "know your data"^3 ).

What is your job structure,
smth like this for testing?

SQL DataSource --> RemoveDups --> SeqFile

What is the SQL that is generated (or specified/built) for
the data source?

In the remove-duplicates stage, what are you settings
for...
Stage.Properties ?
Keys That Define Duplicates:
Options
Duplicate To Retain: ?

Input.Partitioning ?
In particular, what are your:
Partition type:
Sorting: Perform? Stable? Unique?

John G.

bhargava129189 wrote:...when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage

only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.

kumar_s · Post by **kumar_s** » Wed Jan 10, 2007 4:49 pm

I still doubt, whether SELECT DISTINCT FROM TABLE will give out records without any syntax error.

jgreve · Post by **jgreve** » Thu Jan 11, 2007 4:21 pm

kumar_s wrote:I still doubt, whether SELECT DISTINCT FROM TABLE will give out records without any syntax error.

oops

That would work better with some column specifiers,
like a * or smth (and by "better", I mean, "as opposed
to not at all").

Thanks.
I've tweaked the original example.

DSXchange

Removing Duplicates In Datastage Px

Removing Duplicates In Datastage Px

more info, please Re: Removing Duplicates In Datastage Px

oops r.e. SELECT DISTINCT FROM TABLE