To capture duplicate rows

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Abhinav
Premium Member
Premium Member
Posts: 65
Joined: Tue Jun 29, 2004 10:26 am
Location: California

To capture duplicate rows

Post by Abhinav »

Hi all,

I have source file with the values as follows.

each row has one of its column as row no

1
2
3
3
3
4
5

I am using stage variables to capture these duplicate rows.

It works fine as i can capture the rows with No 3 as duplicates, but the only issue is it captures the last two rows with the row no 3.

what i need is to capture the first two rows where row no is 3 skipping the last row. as all the rows are sorted in ascending order. i have to capture the latest record which is the last record.

Any solution would be appreciated.

Thanks

Abhi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Show us the initialization and derivation expressions for your stage variables.

You should be able to do this fairly easily, given that your data are sorted.

Code: Select all

svSeenAlready     InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue       InLink.columnname
The stage variables must be evaluated in that order.
Your output constraint is simply the (Boolean) value of svSeenAlready.
If you don't want multiples, run that through an Aggregator stage grouping by that column.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pramod_au
Participant
Posts: 30
Joined: Thu Feb 06, 2003 8:30 am
Location: London,UK

Re: To capture duplicate rows

Post by pramod_au »

Hi
Try this out

Sequential File ---> Transformer1 ---------------> Aggregator ----------------> Transformer2(count >1) --- > seq with Duplicate records
Transformer1
=========
Define a column called count assign value 1 to all the rows that u read from the sequential file.

In the aggregrator group by the column (1,2,3,3,4,5) and sum the count.

In the Transformer2 have constraint count column > 1 write it into seq file.

Data Transformation

Input data
1
2
3
3
4

After Transformer1
1,1
2,1
3,1
3,1
4,1

After Aggregator
1,1
2,1
3,2
4,1

after Transformer2 (constraint: count =1)
3,2

Thanks
Pramod
Thanks
Pramod
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

... which is functionally equivalent to SQL

Code: Select all

select columnname, count(*)
from filename
group by columnname
having count(*) > 1 ;
which you could execute, if the source is a file, via the ODBC driver for text files.

There are many other solutions for this particular problem.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
alexysflores
Participant
Posts: 18
Joined: Mon Jan 12, 2004 7:20 am
Location: USA

Re: To capture duplicate rows

Post by alexysflores »

You can skin a cat with different solution, my advise use datastage built in function called compare previous row functions.

[quote="Abhinav"]Hi all,

I have source file with the values as follows.

each row has one of its column as row no

1
2
3
3
3
4
5

I am using stage variables to capture these duplicate rows.

It works fine as i can capture the rows with No 3 as duplicates, but the only issue is it captures the last two rows with the row no 3.

what i need is to capture the first two rows where row no is 3 skipping the last row. as all the rows are sorted in ascending order. i have to capture the latest record which is the last record.

Any solution would be appreciated.

Thanks

Abhi[/quote]
Abhinav
Premium Member
Premium Member
Posts: 65
Joined: Tue Jun 29, 2004 10:26 am
Location: California

Post by Abhinav »

ray.wurlod wrote:Show us the initialization and derivation expressions for your stage variables.

You should be able to do this fairly easily, given that your data are sorted.

Code: Select all

svSeenAlready     InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue       InLink.columnname
The stage variables must be evaluated in that order.
Your output constraint is simply the (Boolean) value of svSeenAlready.
If you don't want multiples, run that through an Aggregator stage grouping by that column.

Hi Ray

I tried this, it still captures the last two records of No 3.

but i need to capture the first two records of No 3.

Any other solution i can try.

Here is what i had,

Code: Select all

  sv1          InLink.columnname = sv2
  sv2          InLink.columnname 
My constraint is sv1.

This works fine to capture the last two records leaving the first, but i need the first two records leaving the last one.

Thanks

Abhi
Abhinav
Premium Member
Premium Member
Posts: 65
Joined: Tue Jun 29, 2004 10:26 am
Location: California

Re: To capture duplicate rows

Post by Abhinav »

alexysflores wrote:You can skin a cat with different solution, my advise use datastage built in function called compare previous row functions.
This helps to capture the last two duplicate rows as it is comparing with previous value, but how do i capture the previous record instead of capturing the current record which checks ( comparewith previous value)

This wouldnt capture the first record of my example.

Any solun for this.

Thanks

Abhi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

None of these approaches using stage variables can output the previous value, because all stage variables have been evaluated before any output link is processed. You could introduce an additional stage variable.

Code: Select all

svPriorValue      svPrevValue
svSeenAlready     InLink.columnname <= svPrevValue And @INROWNUM <> 1 
svPrevValue       InLink.columnname
Use svPriorValue as an output column derivation expression.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply