To capture duplicate rows

Abhinav · Post by **Abhinav** » Fri Nov 26, 2004 1:45 pm

Hi all,

I have source file with the values as follows.

each row has one of its column as row no

1
2
3
3
3
4
5

I am using stage variables to capture these duplicate rows.

It works fine as i can capture the rows with No 3 as duplicates, but the only issue is it captures the last two rows with the row no 3.

what i need is to capture the first two rows where row no is 3 skipping the last row. as all the rows are sorted in ascending order. i have to capture the latest record which is the last record.

Any solution would be appreciated.

Thanks

Abhi

ray.wurlod · Post by **ray.wurlod** » Fri Nov 26, 2004 3:26 pm

Show us the initialization and derivation expressions for your stage variables.

You should be able to do this fairly easily, given that your data are sorted.

Code: Select all

svSeenAlready     InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue       InLink.columnname

The stage variables must be evaluated in that order.
Your output constraint is simply the (Boolean) value of svSeenAlready.
If you don't want multiples, run that through an Aggregator stage grouping by that column.

pramod_au · Post by **pramod_au** » Sat Nov 27, 2004 10:55 am

Hi
Try this out

Sequential File ---> Transformer1 ---------------> Aggregator ----------------> Transformer2(count >1) --- > seq with Duplicate records
Transformer1
=========
Define a column called count assign value 1 to all the rows that u read from the sequential file.

In the aggregrator group by the column (1,2,3,3,4,5) and sum the count.

In the Transformer2 have constraint count column > 1 write it into seq file.

Data Transformation

Input data
1
2
3
3
4

After Transformer1
1,1
2,1
3,1
3,1
4,1

After Aggregator
1,1
2,1
3,2
4,1

after Transformer2 (constraint: count =1)
3,2

Thanks
Pramod

ray.wurlod · Post by **ray.wurlod** » Sat Nov 27, 2004 3:56 pm

... which is functionally equivalent to SQL

Code: Select all

select columnname, count(*)
from filename
group by columnname
having count(*) > 1 ;

which you could execute, if the source is a file, via the ODBC driver for text files.

There are many other solutions for this particular problem.

alexysflores · Post by **alexysflores** » Sun Nov 28, 2004 1:25 pm

You can skin a cat with different solution, my advise use datastage built in function called compare previous row functions.

[quote="Abhinav"]Hi all,

I have source file with the values as follows.

each row has one of its column as row no

1
2
3
3
3
4
5

I am using stage variables to capture these duplicate rows.

It works fine as i can capture the rows with No 3 as duplicates, but the only issue is it captures the last two rows with the row no 3.

what i need is to capture the first two rows where row no is 3 skipping the last row. as all the rows are sorted in ascending order. i have to capture the latest record which is the last record.

Any solution would be appreciated.

Thanks

Abhi[/quote]

Abhinav · Post by **Abhinav** » Mon Nov 29, 2004 10:10 am

ray.wurlod wrote:Show us the initialization and derivation expressions for your stage variables.

You should be able to do this fairly easily, given that your data are sorted.
Code: Select all
svSeenAlready     InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue       InLink.columnname
The stage variables must be evaluated in that order.
Your output constraint is simply the (Boolean) value of svSeenAlready.
If you don't want multiples, run that through an Aggregator stage grouping by that column.

Hi Ray

I tried this, it still captures the last two records of No 3.

but i need to capture the first two records of No 3.

Any other solution i can try.

Here is what i had,

Code: Select all

  sv1          InLink.columnname = sv2
  sv2          InLink.columnname

My constraint is sv1.

This works fine to capture the last two records leaving the first, but i need the first two records leaving the last one.

Thanks

Abhi

Abhinav · Post by **Abhinav** » Mon Nov 29, 2004 10:28 am

alexysflores wrote:You can skin a cat with different solution, my advise use datastage built in function called compare previous row functions.

This helps to capture the last two duplicate rows as it is comparing with previous value, but how do i capture the previous record instead of capturing the current record which checks ( comparewith previous value)

This wouldnt capture the first record of my example.

Any solun for this.

Thanks

Abhi

ray.wurlod · Post by **ray.wurlod** » Mon Nov 29, 2004 2:50 pm

None of these approaches using stage variables can output the previous value, because all stage variables have been evaluated before any output link is processed. You could introduce an additional stage variable.

Code: Select all

svPriorValue      svPrevValue
svSeenAlready     InLink.columnname <= svPrevValue And @INROWNUM <> 1 
svPrevValue       InLink.columnname

Use svPriorValue as an output column derivation expression.

DSXchange

To capture duplicate rows

To capture duplicate rows

Re: To capture duplicate rows

Re: To capture duplicate rows

Re: To capture duplicate rows