Hi,
Can Anyone let me know the most eficient way to remove duplicates and store them.....
I can acheive this by using a transfromer and use a stage variabe or use sort and key change column.
Transformer becomes sequential and Sorting takes long time....
Is there any other way?
Efficient Way to remove Duplicates and store them
Moderators: chulett, rschirm, roy
There are only two basic methods of getting duplicates, either sorting the data by the column in question and comparing adjacent rows, or by keeping the whole contents of the file as it is being read in memory and doing a comparison (i.e. array or table scan). Either method may be quicker and more efficient depending upon the level of sorting, amount of duplication, and number of records.
Sorting algorithms are generally efficient, and sorted data has other benefits, so I tend to prefer sorting and using a transform stage variable.
Sorting algorithms are generally efficient, and sorted data has other benefits, so I tend to prefer sorting and using a transform stage variable.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I prefer a fork join design, where one leg of the fork counts the number of rows associated with each key value - this is joined back to the original row and used to filter based on count = 1 or otherwise.
Possibly easier is to generate a Key Change column in the Sort stage; duplicates have this set to 0.
Possibly easier is to generate a Key Change column in the Sort stage; duplicates have this set to 0.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hey Ray,
Thnaks for the response. Sometime, when I use fork-join, the job just runs for long time. I mean it never finishes. But I think I can implement it in 2 step process by temporary storage in Dataset.
But still I would require aggrerator and a Filter/Transformer to implement this. When the volume is huge, Won't aggrerator be time consuming?
Thnaks for the response. Sometime, when I use fork-join, the job just runs for long time. I mean it never finishes. But I think I can implement it in 2 step process by temporary storage in Dataset.
But still I would require aggrerator and a Filter/Transformer to implement this. When the volume is huge, Won't aggrerator be time consuming?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I would avoid using a fork join, it is possible that your job will wait forever. See the advanced developers guide under chapter 4: Link Buffering:
Deadlock situations can occur where you have a fork-join in your job. This is where a stage has two output links whose data paths are joined together later in the job. The situation can arise where all the stages in the flow are waiting for each other to read or write, so none of them can proceed
I have seen smaller jobs finish ok with a fork join, as DataStage inserts buffers when a fork join is used. If you want to go with the fork join design you may need to alter the buffer handling for the input links on the join stage.
Deadlock situations can occur where you have a fork-join in your job. This is where a stage has two output links whose data paths are joined together later in the job. The situation can arise where all the stages in the flow are waiting for each other to read or write, so none of them can proceed
I have seen smaller jobs finish ok with a fork join, as DataStage inserts buffers when a fork join is used. If you want to go with the fork join design you may need to alter the buffer handling for the input links on the join stage.