Page 1 of 1

Ignore duplicated rows

Posted: Wed Jun 16, 2004 5:37 am
by xcasals
Hi all,

I'm trying to ignore the duplicated rows from an input. I mean, this is the input:

AAAA,BBBB,CCCC
AAAA,BBBB,CCCC
AAAA,BBBB,CCCC
XXX,YYY,ZZZ
XXX,YYY,ZZZ
XXX,YYY,ZZZ

and this is the output i want to get

AAAA,BBBB,CCCC
XXX,YYY,ZZZ


seems easy, but am a newbie.

thanks.

Posted: Wed Jun 16, 2004 6:06 am
by denzilsyb
You could write the input into a HASH stage and make the columns key columns, that will get rid of the duplicate records.

dnzl

Re: Ignore duplicated rows

Posted: Wed Jun 16, 2004 6:08 am
by vigneshra
xcasals wrote:Hi all,

I'm trying to ignore the duplicated rows from an input. I mean, this is the input:

AAAA,BBBB,CCCC
AAAA,BBBB,CCCC
AAAA,BBBB,CCCC
XXX,YYY,ZZZ
XXX,YYY,ZZZ
XXX,YYY,ZZZ

and this is the output i want to get

AAAA,BBBB,CCCC
XXX,YYY,ZZZ


seems easy, but am a newbie.

thanks.

Based on some key you need to sort the data which can be done through a sorter stage. Then the sorted data is fed to a duplicates remover stage which gives you the required output. Since I am new to DataStage, any experts please correct me if I am wrong :roll:

Re: Ignore duplicated rows

Posted: Wed Jun 16, 2004 8:46 am
by degraciavg
xcasals wrote: I'm trying to ignore the duplicated rows from an input.
If your input data is from a relational database, you can actually add the DISTINCT clause in your DML (i.e. SELECT query). You remove unnecessary steps this way.

Posted: Wed Jun 16, 2004 9:48 am
by KeithM
If your input is an ODBC stage, rather than changing the sql to be user defined in order to specify the 'Distinct' keyword, you could just go to the columns tab and group by all of your columns. This will have the same effect as the distinct and give you the results that you want.

Re: Ignore duplicated rows

Posted: Wed Jun 16, 2004 11:54 am
by jseclen
Hi,

If your input is a sequential file you can use stages variables.

Define in the transformer stage:

Dupli = If (LastField = ActualField) Then 1 Else 0
LastField = If (LastField <> ActualField) Then ActualField Else LastField

In the constraint you define the next condition

Dupli = 0

In the output file you will have the desired records.

Re: Ignore duplicated rows

Posted: Wed Jun 16, 2004 4:36 pm
by ray.wurlod
vigneshra wrote: Based on some key you need to sort the data which can be done through a sorter stage. Then the sorted data is fed to a duplicates remover stage which gives you the required output. Since I am new to DataStage, any experts please correct me if I am wrong :roll:
The Remove Duplicates stage type is not available for server jobs; it is only available for parallel jobs.