Removing Duplicates In Datastage Px

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bhargava129189
Participant
Posts: 18
Joined: Tue Nov 15, 2005 9:36 am

Removing Duplicates In Datastage Px

Post by bhargava129189 »

Hi all,

I have a scenario to remove duplicates.

there are 257573 records coming in from source table and when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage

only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.

And the key column used for checking duplicates is a direct move from table to the final data set.

Also when i send 30 records by restricting the input records i get 30 records as output.
I mean that the output is not crossing 41 in any case.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Did you make sure your data going into the remove dups stage was hash partitioned and sorted on the right keys?

If you use the sort stage with teh Unique option, then the keys will match that of the hash partitioning.

If you use the remove dups stage, then you will a) hash on the keys, b) sort on the keys and then (optionally) any additional fields that may impact deduping, and finally c) call the remove dups on the original keys.

What do I mean in b? Say you have a key of account. For some reason there are multiple start dates for some accounts and that is what is causing the duplicates. If you know that the earliest date is the one you want to keep, then sort not only on account, but also on the start date. If you sort the date ascending, then the earliest date is first so in the remove dups stage keep the first occurence of teh account. Conversely, if you sort descending, then keep the last occurence in the remove dups stage.

If you don't care which occurence you keep, then use the sort stage to do it all.

That's my 2 cents... Hope it helps.

Brad.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

I say remove duplicates in the select itself.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

DSguru2B wrote:I say remove duplicates in the select itself.
Hmm, that would certainly be faster and easier. Unless he has to do something with the duplicates that are identifies - good data goes one direction, dup data goes the other. Of course, it would be faster to run 2 SQLs too, one that select good data and one that selects dups.

You know, DataStage should create a SQL interface to datasets and data streams. If you are pulling data from a database, you can do it easy enough. But what if you are doing just file processing? It would be nice to be able to run ANSI SQL against the data stream.

Maybe Hawk+5? :)

Brad.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

There is a way, load it into a temp database table and go wild with the sql :wink:
Jokes apart, that is a very good enhancement piece. ODBC can connect to flat files but thats limited to windows. For unix, they want $$$. Someone should come up with a cool utility that can run sql queries on a flat file.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You mean like Oracle external tables? :P
-craig

"You can never have too many knives" -- Logan Nine Fingers
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Exaclty like Oracle External Tables.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

more info, please Re: Removing Duplicates In Datastage Px

Post by jgreve »

(edited: as per kumar_s, put the * in "select distinct" examples, below.)

You said "key column", as in just one key.
Cool - that makes it easier.

You said when you "select distinct" you're
getting all 257573 as distinct.... Hmm.
Can you post that SQL query?
Ok... Color me skeptical :) but... do these
queries both return the same value?

Code: Select all

   select count(*) from sometable  --> 257573 
   select count(distinct key) from sometable --> 257573 
If you were doing just "select distinct * from sometable"
then it easilly could bring back every row, especially if
you have some kind of seq# or insert-timestamp
in the records - e.g. "distinctness" is based on unique combinations
of all columns for each row in your results set.
Consider these examples

Code: Select all

--- foo ---   . SELECT DISTINCT * FROM FOO . SELECT DISTINCT SEQ FROM FOO
SEQ   COLOR   . yields this result set:    . yields this result set:
 1    red     .   SEQ   COLOR              .    SEQ
 1    blue    .    1    red                .     1
 1    green   .    1    blue               .     2
 2    red     .    1    green              .
 2    blue    .    2    red                . SELECT DISTINCT COLOR FROM FOO
              .    2    blue               . would yield this result set:
              .                            .    COLOR
              .                            .    red
              .                            .    blue
              .                            .    green
....................................................................


If you'd like to see your frequency distribution of unique
values, try something like this:

Code: Select all

select count(*) as cnt, key
from sometable
group by key
order by cnt desc, key

What is your record definition? (I'm reminded of
the mantra, "know your data"^3 ).



What is your job structure,
smth like this for testing?

SQL DataSource --> RemoveDups --> SeqFile

What is the SQL that is generated (or specified/built) for
the data source?

In the remove-duplicates stage, what are you settings
for...
Stage.Properties ?
Keys That Define Duplicates:
Options
Duplicate To Retain: ?

Input.Partitioning ?
In particular, what are your:
Partition type:
Sorting: Perform? Stable? Unique?

John G.
bhargava129189 wrote:...when i am trying to remove duplicates
a) either in sort stage.
b) using remove duplicate stage

only 41 records are exported.
While i write a select distinct query in sql i get all 257573 as distinct.
Last edited by jgreve on Thu Jan 11, 2007 4:23 pm, edited 2 times in total.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

I still doubt, whether SELECT DISTINCT FROM TABLE will give out records without any syntax error.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

oops r.e. SELECT DISTINCT FROM TABLE

Post by jgreve »

kumar_s wrote:I still doubt, whether SELECT DISTINCT FROM TABLE will give out records without any syntax error.
oops :oops:
That would work better with some column specifiers,
like a * or smth (and by "better", I mean, "as opposed
to not at all").

Thanks.
I've tweaked the original example.
Post Reply