Page 1 of 1
To capture duplicate rows
Posted: Fri Nov 26, 2004 1:45 pm
by Abhinav
Hi all,
I have source file with the values as follows.
each row has one of its column as row no
1
2
3
3
3
4
5
I am using stage variables to capture these duplicate rows.
It works fine as i can capture the rows with No 3 as duplicates, but the only issue is it captures the last two rows with the row no 3.
what i need is to capture the first two rows where row no is 3 skipping the last row. as all the rows are sorted in ascending order. i have to capture the latest record which is the last record.
Any solution would be appreciated.
Thanks
Abhi
Posted: Fri Nov 26, 2004 3:26 pm
by ray.wurlod
Show us the initialization and derivation expressions for your stage variables.
You should be able to do this fairly easily, given that your data are sorted.
Code: Select all
svSeenAlready InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue InLink.columnname
The stage variables must be evaluated in that order.
Your output constraint is simply the (Boolean) value of svSeenAlready.
If you don't want multiples, run that through an Aggregator stage grouping by that column.
Re: To capture duplicate rows
Posted: Sat Nov 27, 2004 10:55 am
by pramod_au
Hi
Try this out
Sequential File ---> Transformer1 ---------------> Aggregator ----------------> Transformer2(count >1) --- > seq with Duplicate records
Transformer1
=========
Define a column called count assign value 1 to all the rows that u read from the sequential file.
In the aggregrator group by the column (1,2,3,3,4,5) and sum the count.
In the Transformer2 have constraint count column > 1 write it into seq file.
Data Transformation
Input data
1
2
3
3
4
After Transformer1
1,1
2,1
3,1
3,1
4,1
After Aggregator
1,1
2,1
3,2
4,1
after Transformer2 (constraint: count =1)
3,2
Thanks
Pramod
Posted: Sat Nov 27, 2004 3:56 pm
by ray.wurlod
... which is functionally equivalent to SQL
Code: Select all
select columnname, count(*)
from filename
group by columnname
having count(*) > 1 ;
which you could execute, if the source is a file, via the ODBC driver for text files.
There are many other solutions for this particular problem.
Re: To capture duplicate rows
Posted: Sun Nov 28, 2004 1:25 pm
by alexysflores
You can skin a cat with different solution, my advise use datastage built in function called compare previous row functions.
[quote="Abhinav"]Hi all,
I have source file with the values as follows.
each row has one of its column as row no
1
2
3
3
3
4
5
I am using stage variables to capture these duplicate rows.
It works fine as i can capture the rows with No 3 as duplicates, but the only issue is it captures the last two rows with the row no 3.
what i need is to capture the first two rows where row no is 3 skipping the last row. as all the rows are sorted in ascending order. i have to capture the latest record which is the last record.
Any solution would be appreciated.
Thanks
Abhi[/quote]
Posted: Mon Nov 29, 2004 10:10 am
by Abhinav
ray.wurlod wrote:Show us the initialization and derivation expressions for your stage variables.
You should be able to do this fairly easily, given that your data are sorted.
Code: Select all
svSeenAlready InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue InLink.columnname
The stage variables must be evaluated in that order.
Your output constraint is simply the (Boolean) value of svSeenAlready.
If you don't want multiples, run that through an Aggregator stage grouping by that column.
Hi Ray
I tried this, it still captures the last two records of No 3.
but i need to capture the first two records of No 3.
Any other solution i can try.
Here is what i had,
Code: Select all
sv1 InLink.columnname = sv2
sv2 InLink.columnname
My constraint is sv1.
This works fine to capture the last two records leaving the first, but i need the first two records leaving the last one.
Thanks
Abhi
Re: To capture duplicate rows
Posted: Mon Nov 29, 2004 10:28 am
by Abhinav
alexysflores wrote:You can skin a cat with different solution, my advise use datastage built in function called compare previous row functions.
This helps to capture the last two duplicate rows as it is comparing with previous value, but how do i capture the previous record instead of capturing the current record which checks ( comparewith previous value)
This wouldnt capture the first record of my example.
Any solun for this.
Thanks
Abhi
Posted: Mon Nov 29, 2004 2:50 pm
by ray.wurlod
None of these approaches using stage variables can output the previous value, because all stage variables have been evaluated before any output link is processed. You could introduce an additional stage variable.
Code: Select all
svPriorValue svPrevValue
svSeenAlready InLink.columnname <= svPrevValue And @INROWNUM <> 1
svPrevValue InLink.columnname
Use svPriorValue as an output column derivation expression.