Remove Duplicates using Internal Sort

ravij · Post by **ravij** » Fri Dec 30, 2005 12:40 am

Hi,

I am Using Rem Dup stage to remove duplicates n using Inter Sort for sorting the date. Whe use the Internal sort(Perform sort) its giving some warning msg as given below.

Rem_Dups: When checking operator: User inserted sort "Rem_Dups.To_RemDup_Sort" does not fulfill the sort requirements of the downstream operator "Rem_Dups"

No of rec=1 million
partition type: Hash
How I can I achieve this?
Thanks in advance.

kumar_s · Post by **kumar_s** » Fri Dec 30, 2005 12:55 am

Hi,

The key you chosed for sort and the remove duplicate key is not identical.

-Kumar

balajisr · Post by **balajisr** » Fri Dec 30, 2005 12:55 am

Hi

Is the key columns with which you are sorting and the key columns used to remove duplicates are same?

-Balaji S.R

kumar_s · Post by **kumar_s** » Fri Dec 30, 2005 1:01 am

Hi,
For a question, answers can identical but how come even the time of post is also matched

-Kumar

ravij · Post by **ravij** » Fri Dec 30, 2005 2:56 am

Hi,

Thanks for ur advice. I got the solution for it. But One more doubt I have. i.e

Here I am giving the 4 cases:
1) I have used Seqfile--->RemDupStage--->Peek
In this case I have processed the data without sorting it. Though its giving the expected results.

2) I used Internal Sort(Perform Sort) n processed the same data. Its giving the same results as above one.

3) I sued External Sort stage before Remdup stage n processed the data again I got the same result.

4) In this case I used seqfile--->Sortstage--->peek. I did not use RemDup Stage, though I got the same results as I got for the above cases. I gave "Allow Duplicates =False"

Mydoubt: 1) I am able to remove the duplicates using a single Sort Stage. Why do we need the Rem Dup Stage?
2) in the case 1 without sorting the data also I able to Remove the dups. Is it not necessary to sort the data before removing the dups?

3) What is the difference between Internal sort(Defining in Rem Dup stage itself) and External Sort stage? Both r performing the same operation. Then Which is the best one to use? n which is the performance bottleneck?

My be my doubt is a bit long but read patiently n give the answer.

thnaks in Advance.

balajisr · Post by **balajisr** » Fri Dec 30, 2005 3:20 am

Code: Select all

2) in the case 1 without sorting the data also I able to Remove the dups. Is it not necessary to sort the data before removing the dups?

If your data is already sorted then it works this way. Check whether the key values are adjacent to each other. Try not to place identical key values adjacent from each other and verify it.

Code: Select all

3) What is the difference between Internal sort(Defining in Rem Dup stage itself) and External Sort stage? Both r performing the same operation. Then Which is the best one to use? n which is the performance bottleneck?

External sort stage will make your job much more easier to understand for others.

It also provides you an option of sorting on secondary keys when primary key is already sorted. This may improve performance and also save temporary space.

Code: Select all

Mydoubt: 1) I am able to remove the duplicates using a single Sort Stage. Why do we need the Rem Dup Stage?

I am not sure about the answer for this. Experts will help you.

-Balaji S.R

kumar_s · Post by **kumar_s** » Fri Dec 30, 2005 3:39 am

Hi,
If i remember properly, Arnd had give an wonderful example for this issue.
Say if you want to vacate you house and go to another place with all you house holds,
Lets say you have a Big container lorry to carry all your house holds and a racing car and your BMW with you.
Since you have to carry you goods in the conainter, will you choose the same for you travel.
Or will you choose the racing car for the normal roads.
All vehiles are designed for its own purpose.
Racing car for normal travel, or lorry for racing purpose wont be effecient.

Prefer the explicit sort stage, unless you have constraints with stages.
Again its all upto the circumstance.

-Kumar

ravij · Post by **ravij** » Fri Dec 30, 2005 4:48 am

Hi Balaji,

Thanks for your advice.

This may improve performance and also save temporary space.

How will it save the Temporary memory?
Could u explain indetail plz?

Thanks in advance

balajisr · Post by **balajisr** » Fri Dec 30, 2005 5:10 am

Hi

When sorting is done it is done on primary keys and within primary key on secondary keys. When data is already sorted on primary keys then only secondary keys within the primary key needs to be sorted.

Did you try working on your doubt no 2? If so Did you find out the reason as to how it allowed you to remove duplicates even though the dataset is not sorted?

- Balaji S.R

ravij · Post by **ravij** » Fri Dec 30, 2005 6:18 am

Hi Balaji

Did you try working on your doubt no 2? If so Did you find out the reason as to how it allowed you to remove duplicates even though the dataset is not sorted?

I saw the source file, its not in sorted form. But the job producing proper output.

I don't know what is the reason?
any assistance can be appreciated

Thans in advance

kumar_s · Post by **kumar_s** » Fri Dec 30, 2005 8:24 am

Hi Rajiv,
If you possible, produce the pattern of the input data you use and the result you got and expected result.
It will be more easy to analysis.
Provided the set of data you use is small in nuber.

-Kumar