I am Using Rem Dup stage to remove duplicates n using Inter Sort for sorting the date. Whe use the Internal sort(Perform sort) its giving some warning msg as given below.
Rem_Dups: When checking operator: User inserted sort "Rem_Dups.To_RemDup_Sort" does not fulfill the sort requirements of the downstream operator "Rem_Dups"
No of rec=1 million
partition type: Hash
How I can I achieve this?
Thanks in advance.
Thanks for ur advice. I got the solution for it. But One more doubt I have. i.e
Here I am giving the 4 cases:
1) I have used Seqfile--->RemDupStage--->Peek
In this case I have processed the data without sorting it. Though its giving the expected results.
2) I used Internal Sort(Perform Sort) n processed the same data. Its giving the same results as above one.
3) I sued External Sort stage before Remdup stage n processed the data again I got the same result.
4) In this case I used seqfile--->Sortstage--->peek. I did not use RemDup Stage, though I got the same results as I got for the above cases. I gave "Allow Duplicates =False"
Mydoubt: 1) I am able to remove the duplicates using a single Sort Stage. Why do we need the Rem Dup Stage?
2) in the case 1 without sorting the data also I able to Remove the dups. Is it not necessary to sort the data before removing the dups?
3) What is the difference between Internal sort(Defining in Rem Dup stage itself) and External Sort stage? Both r performing the same operation. Then Which is the best one to use? n which is the performance bottleneck?
My be my doubt is a bit long but read patiently n give the answer.
2) in the case 1 without sorting the data also I able to Remove the dups. Is it not necessary to sort the data before removing the dups?
If your data is already sorted then it works this way. Check whether the key values are adjacent to each other. Try not to place identical key values adjacent from each other and verify it.
3) What is the difference between Internal sort(Defining in Rem Dup stage itself) and External Sort stage? Both r performing the same operation. Then Which is the best one to use? n which is the performance bottleneck?
External sort stage will make your job much more easier to understand for others.
It also provides you an option of sorting on secondary keys when primary key is already sorted. This may improve performance and also save temporary space.
Hi,
If i remember properly, Arnd had give an wonderful example for this issue.
Say if you want to vacate you house and go to another place with all you house holds,
Lets say you have a Big container lorry to carry all your house holds and a racing car and your BMW with you.
Since you have to carry you goods in the conainter, will you choose the same for you travel.
Or will you choose the racing car for the normal roads.
All vehiles are designed for its own purpose.
Racing car for normal travel, or lorry for racing purpose wont be effecient.
Prefer the explicit sort stage, unless you have constraints with stages.
Again its all upto the circumstance.
When sorting is done it is done on primary keys and within primary key on secondary keys. When data is already sorted on primary keys then only secondary keys within the primary key needs to be sorted.
Did you try working on your doubt no 2? If so Did you find out the reason as to how it allowed you to remove duplicates even though the dataset is not sorted?
Did you try working on your doubt no 2? If so Did you find out the reason as to how it allowed you to remove duplicates even though the dataset is not sorted?
I saw the source file, its not in sorted form. But the job producing proper output.
I don't know what is the reason?
any assistance can be appreciated
Hi Rajiv,
If you possible, produce the pattern of the input data you use and the result you got and expected result.
It will be more easy to analysis.
Provided the set of data you use is small in nuber.