Ignoring Duplicates while reading from Seq File

chowdary · Post by **chowdary** » Tue Dec 27, 2005 8:36 am

Hi all,

I have Seq file like this

ColA ColB
123 ABC
234 CDE
123 CDE
456 HIJ

While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE'

Please throw some suggestions how to solve this scenarion

Thanks
Chowdary

WoMaWil · Post by **WoMaWil** » Tue Dec 27, 2005 9:01 am

first number your rows, then sort your file by ColA and RowNum, use StageVariable for to remember the first line of each ColA-Element and put ColA and the memorized Col B in a HashFile, read thereafter the Hashfile and you have it.

Wolfgang

chowdary · Post by **chowdary** » Tue Dec 27, 2005 9:19 am

Hi

Thanks for the reply, I got it a part of it but i didn't understand the part which you are saying about the "Using the StageVariable ot remember the first line of each ColA element". Can you please explain more elabrate on this....

thank you very much

WoMaWil wrote:first number your rows, then sort your file by ColA and RowNum, use StageVariable for to remember the first line of each ColA-Element and put ColA and the memorized Col B in a HashFile, read thereafter the Hashfile and you have it.

Wolfgang

ArndW · Post by **ArndW** » Tue Dec 27, 2005 9:54 am

Chowdary,

this subject has been handled in this forum several times before. Also, it might be a good idea at this point in time to revisit your self-assessment in your user profile.

If you create two stage variables in the following order:

Code: Select all

ColAIsTheSameAsLastRow   IF LastColA = In.Col THEN 1 ELSE 0
LastColA                 In.ColA

then you can use the value of the first to branch your logic.

kcbland · Post by **kcbland** » Tue Dec 27, 2005 10:13 am

Or use the Aggregator stage and group by the key columns and use FIRST derivation for all other attribute columns. This effectively gives you the first row, as long as you handle NULL values properly, like with a substitute value to get it thru the Aggregator for non-nullable values (numerics) and then substitute back to NULL.

kcshankar · Post by **kcshankar** » Tue Dec 27, 2005 11:46 pm

Hi Chowdary,
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.

In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar

In your case the output will be

COlA ColB Return_Val

123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0

To get unique record use Constraint in the output link Return_Val = 0

Code: Select all

I have Seq file like this 

ColA ColB 
123 ABC 
234 CDE 
123 CDE 
456 HIJ 

While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE' 
Please throw some suggestions how to solve this scenarion

This routine should not be used more than one place in a job.
Go thru the documents for this routine's limitations.

regards
kcs

kcshankar · Post by **kcshankar** » Wed Dec 28, 2005 12:55 am

Hi Chowdary,

Before using RowProcCompareWithPreviousValue sort your data.

regards
kcs

Sreenivasulu · Post by **Sreenivasulu** » Wed Dec 28, 2005 9:33 am

HI,

You can load the data in a table and then again select the distinct records.

This is costly process in terms of database resources but is beneficial if
millions of records are present.

The solution of performing a sort and comparing with previous value could be very slow for millions of records.

Regards
Sreeni

ArndW · Post by **ArndW** » Wed Dec 28, 2005 9:41 am

Sreeni,

but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.

chowdary · Post by **chowdary** » Wed Dec 28, 2005 11:58 am

Hi all,

Thanks guys for all you inputs, I used the routine RowProcCompareWithPreviousValue on sorted data and it seems it is working fine, the users are validating the data. Hopefully it will work.

Once Again thanks for all you suggestions.
Thanks
Chowdary

kcshankar wrote:Hi Chowdary,
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.

In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar

In your case the output will be

COlA ColB Return_Val

123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0

To get unique record use Constraint in the output link Return_Val = 0
Code: Select all
I have Seq file like this 

ColA ColB 
123 ABC 
234 CDE 
123 CDE 
456 HIJ 

While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE' 
Please throw some suggestions how to solve this scenarion 
This routine should not be used more than one place in a job.
Go thru the documents for this routine's limitations.

regards
kcs

datastage · Post by **datastage** » Wed Dec 28, 2005 12:52 pm

I agree with this, and that there is no need to go to a database table here. Obviously the RowProcCompareWithPreviousValue method worked for you, and I also think the other idea involving sorting and then writing to a hashed file is a good one. The ability to have a primary key on the hashed file is all you need, nothing else that comes with a traditional database

ArndW wrote:Sreeni,

but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.

Sreenivasulu · Post by **Sreenivasulu** » Thu Dec 29, 2005 5:11 am

HI ArndW,

Thanks for your inputs.

I am some how not convinced of your reasoning that sorting data in a sequential file is faster than the database solution.

I believe that sorting using a database solution is faster
in case of millions of records rather than doing the sort
in a sequential file.

Regards
Sreeni

ArndW · Post by **ArndW** » Thu Dec 29, 2005 5:43 am

Sreeni,

it is very easy for you to reproduce both scenarios, please do so to remove your doubts. That way we aren't dealing with theories or suppositions but cold, hard facts.

I look forward to hearing your results.

p.s. This reminds me of my high-school chemistry teacher. I was doing lab work and told him as part of some experiment that I would melt some sodium chloride (table salt). Instead of telling me I was wrong he asked me to show him - and laughed as I melted and destroyed the test tube while the sodium chloride remained happily in crystalline form. I wouldn't have learned that lesson as well had he just explained to me about melting points.

Sreenivasulu · Post by **Sreenivasulu** » Thu Dec 29, 2005 7:09 am

Hi ArndW

Thanks for your advice. I will try this and get back to you.

Regards
Sreeni