Remove Duplicates using Internal Sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ravij
Premium Member
Premium Member
Posts: 170
Joined: Mon Oct 10, 2005 7:04 am
Location: India

Remove Duplicates using Internal Sort

Post by ravij »

Hi,

I am Using Rem Dup stage to remove duplicates n using Inter Sort for sorting the date. Whe use the Internal sort(Perform sort) its giving some warning msg as given below.
Rem_Dups: When checking operator: User inserted sort "Rem_Dups.To_RemDup_Sort" does not fulfill the sort requirements of the downstream operator "Rem_Dups"
No of rec=1 million
partition type: Hash
How I can I achieve this?
Thanks in advance.
Ravi
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,

The key you chosed for sort and the remove duplicate key is not identical.

-Kumar
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Hi

Is the key columns with which you are sorting and the key columns used to remove duplicates are same?

-Balaji S.R
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
For a question, answers can identical but how come even the time of post is also matched :wink:

-Kumar
ravij
Premium Member
Premium Member
Posts: 170
Joined: Mon Oct 10, 2005 7:04 am
Location: India

Remove Duplicates using Internal Sort

Post by ravij »

Hi,

Thanks for ur advice. I got the solution for it. But One more doubt I have. i.e

Here I am giving the 4 cases:
1) I have used Seqfile--->RemDupStage--->Peek
In this case I have processed the data without sorting it. Though its giving the expected results.

2) I used Internal Sort(Perform Sort) n processed the same data. Its giving the same results as above one.

3) I sued External Sort stage before Remdup stage n processed the data again I got the same result.

4) In this case I used seqfile--->Sortstage--->peek. I did not use RemDup Stage, though I got the same results as I got for the above cases. I gave "Allow Duplicates =False"

Mydoubt: 1) I am able to remove the duplicates using a single Sort Stage. Why do we need the Rem Dup Stage?
2) in the case 1 without sorting the data also I able to Remove the dups. Is it not necessary to sort the data before removing the dups?

3) What is the difference between Internal sort(Defining in Rem Dup stage itself) and External Sort stage? Both r performing the same operation. Then Which is the best one to use? n which is the performance bottleneck?

My be my doubt is a bit long but read patiently n give the answer.

thnaks in Advance.
Ravi
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Code: Select all

2) in the case 1 without sorting the data also I able to Remove the dups. Is it not necessary to sort the data before removing the dups? 
If your data is already sorted then it works this way. Check whether the key values are adjacent to each other. Try not to place identical key values adjacent from each other and verify it.

Code: Select all

3) What is the difference between Internal sort(Defining in Rem Dup stage itself) and External Sort stage? Both r performing the same operation. Then Which is the best one to use? n which is the performance bottleneck? 
External sort stage will make your job much more easier to understand for others.

It also provides you an option of sorting on secondary keys when primary key is already sorted. This may improve performance and also save temporary space.

Code: Select all

Mydoubt: 1) I am able to remove the duplicates using a single Sort Stage. Why do we need the Rem Dup Stage? 
I am not sure about the answer for this. Experts will help you.

-Balaji S.R
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
If i remember properly, Arnd had give an wonderful example for this issue.
Say if you want to vacate you house and go to another place with all you house holds,
Lets say you have a Big container lorry to carry all your house holds and a racing car and your BMW with you.
Since you have to carry you goods in the conainter, will you choose the same for you travel.
Or will you choose the racing car for the normal roads.
All vehiles are designed for its own purpose.
Racing car for normal travel, or lorry for racing purpose wont be effecient.

Prefer the explicit sort stage, unless you have constraints with stages.
Again its all upto the circumstance.

-Kumar
ravij
Premium Member
Premium Member
Posts: 170
Joined: Mon Oct 10, 2005 7:04 am
Location: India

Post by ravij »

Hi Balaji,

Thanks for your advice.
This may improve performance and also save temporary space.
How will it save the Temporary memory?
Could u explain indetail plz?

Thanks in advance
Ravi
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Hi

When sorting is done it is done on primary keys and within primary key on secondary keys. When data is already sorted on primary keys then only secondary keys within the primary key needs to be sorted.

Did you try working on your doubt no 2? If so Did you find out the reason as to how it allowed you to remove duplicates even though the dataset is not sorted?

- Balaji S.R
ravij
Premium Member
Premium Member
Posts: 170
Joined: Mon Oct 10, 2005 7:04 am
Location: India

Post by ravij »

Hi Balaji
Did you try working on your doubt no 2? If so Did you find out the reason as to how it allowed you to remove duplicates even though the dataset is not sorted?
I saw the source file, its not in sorted form. But the job producing proper output.

I don't know what is the reason?
any assistance can be appreciated

Thans in advance
Ravi
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Rajiv,
If you possible, produce the pattern of the input data you use and the result you got and expected result.
It will be more easy to analysis.
Provided the set of data you use is small in nuber. :wink:

-Kumar
Post Reply