Ignoring Duplicates while reading from Seq File
Moderators: chulett, rschirm, roy
Ignoring Duplicates while reading from Seq File
Hi all,
I have Seq file like this
ColA ColB
123 ABC
234 CDE
123 CDE
456 HIJ
While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE'
Please throw some suggestions how to solve this scenarion
Thanks
Chowdary
I have Seq file like this
ColA ColB
123 ABC
234 CDE
123 CDE
456 HIJ
While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE'
Please throw some suggestions how to solve this scenarion
Thanks
Chowdary
Hi
Thanks for the reply, I got it a part of it but i didn't understand the part which you are saying about the "Using the StageVariable ot remember the first line of each ColA element". Can you please explain more elabrate on this....
thank you very much
Thanks for the reply, I got it a part of it but i didn't understand the part which you are saying about the "Using the StageVariable ot remember the first line of each ColA element". Can you please explain more elabrate on this....
thank you very much
WoMaWil wrote:first number your rows, then sort your file by ColA and RowNum, use StageVariable for to remember the first line of each ColA-Element and put ColA and the memorized Col B in a HashFile, read thereafter the Hashfile and you have it.
Wolfgang
Chowdary,
this subject has been handled in this forum several times before. Also, it might be a good idea at this point in time to revisit your self-assessment in your user profile.
If you create two stage variables in the following order:
then you can use the value of the first to branch your logic.
this subject has been handled in this forum several times before. Also, it might be a good idea at this point in time to revisit your self-assessment in your user profile.
If you create two stage variables in the following order:
Code: Select all
ColAIsTheSameAsLastRow IF LastColA = In.Col THEN 1 ELSE 0
LastColA In.ColA
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Or use the Aggregator stage and group by the key columns and use FIRST derivation for all other attribute columns. This effectively gives you the first row, as long as you handle NULL values properly, like with a substitute value to get it thru the Aggregator for non-nullable values (numerics) and then substitute back to NULL.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Hi Chowdary,
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.
In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar
In your case the output will be
COlA ColB Return_Val
123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0
To get unique record use Constraint in the output link Return_Val = 0
This routine should not be used more than one place in a job.
Go thru the documents for this routine's limitations.
regards
kcs
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.
In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar
In your case the output will be
COlA ColB Return_Val
123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0
To get unique record use Constraint in the output link Return_Val = 0
Code: Select all
I have Seq file like this
ColA ColB
123 ABC
234 CDE
123 CDE
456 HIJ
While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE'
Please throw some suggestions how to solve this scenarion
Go thru the documents for this routine's limitations.
regards
kcs
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
HI,
You can load the data in a table and then again select the distinct records.
This is costly process in terms of database resources but is beneficial if
millions of records are present.
The solution of performing a sort and comparing with previous value could be very slow for millions of records.
Regards
Sreeni
You can load the data in a table and then again select the distinct records.
This is costly process in terms of database resources but is beneficial if
millions of records are present.
The solution of performing a sort and comparing with previous value could be very slow for millions of records.
Regards
Sreeni
Sreeni,
but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.
but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Hi all,
Thanks guys for all you inputs, I used the routine RowProcCompareWithPreviousValue on sorted data and it seems it is working fine, the users are validating the data. Hopefully it will work.
Once Again thanks for all you suggestions.
Thanks
Chowdary
Thanks guys for all you inputs, I used the routine RowProcCompareWithPreviousValue on sorted data and it seems it is working fine, the users are validating the data. Hopefully it will work.
Once Again thanks for all you suggestions.
Thanks
Chowdary
kcshankar wrote:Hi Chowdary,
You can also use Transform function RowProcCompareWithPreviousValue to ignore duplicate values.
The routine will check the supplied value with the previous value.
If its same, then 1 is returned else 0 is returned.
In the Transformer Stage,declare stage variable as
RowProcCompareWithPreviousValue(link.colA) = StageVar
In your case the output will be
COlA ColB Return_Val
123 ABC 0
234 CDE 0
123 CDE 1
456 HIJ 0
To get unique record use Constraint in the output link Return_Val = 0
This routine should not be used more than one place in a job.Code: Select all
I have Seq file like this ColA ColB 123 ABC 234 CDE 123 CDE 456 HIJ While i am reading if ColA value has duplicates (in the example 123) then it has to read only one row in which the ColB='ABC', it should ignore the record where ColB='CDE' Please throw some suggestions how to solve this scenarion
Go thru the documents for this routine's limitations.
regards
kcs
I agree with this, and that there is no need to go to a database table here. Obviously the RowProcCompareWithPreviousValue method worked for you, and I also think the other idea involving sorting and then writing to a hashed file is a good one. The ability to have a primary key on the hashed file is all you need, nothing else that comes with a traditional database
ArndW wrote:Sreeni,
but in this case a sort on a sequential file is going to be more efficient that loading the sequential file into a database table and then doing a select distinct on that table and bringing the data back out of the database. There is no possibility of the database solution being faster, whether or 10 or 10 million records.
There are certainly going to be cases where loading data into an interim database table might be more efficient, but in the example posted in this thread using a temporary DB table to remove duplicates isn't a viable alternative.
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.
"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.
"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
HI ArndW,
Thanks for your inputs.
I am some how not convinced of your reasoning that sorting data in a sequential file is faster than the database solution.
I believe that sorting using a database solution is faster
in case of millions of records rather than doing the sort
in a sequential file.
Regards
Sreeni
Thanks for your inputs.
I am some how not convinced of your reasoning that sorting data in a sequential file is faster than the database solution.
I believe that sorting using a database solution is faster
in case of millions of records rather than doing the sort
in a sequential file.
Regards
Sreeni
Sreeni,
it is very easy for you to reproduce both scenarios, please do so to remove your doubts. That way we aren't dealing with theories or suppositions but cold, hard facts.
I look forward to hearing your results.
p.s. This reminds me of my high-school chemistry teacher. I was doing lab work and told him as part of some experiment that I would melt some sodium chloride (table salt). Instead of telling me I was wrong he asked me to show him - and laughed as I melted and destroyed the test tube while the sodium chloride remained happily in crystalline form. I wouldn't have learned that lesson as well had he just explained to me about melting points.
it is very easy for you to reproduce both scenarios, please do so to remove your doubts. That way we aren't dealing with theories or suppositions but cold, hard facts.
I look forward to hearing your results.
p.s. This reminds me of my high-school chemistry teacher. I was doing lab work and told him as part of some experiment that I would melt some sodium chloride (table salt). Instead of telling me I was wrong he asked me to show him - and laughed as I melted and destroyed the test tube while the sodium chloride remained happily in crystalline form. I wouldn't have learned that lesson as well had he just explained to me about melting points.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am