Use of funnel and sort

clarcombe · Post by **clarcombe** » Tue Mar 15, 2011 7:35 am

I have two datasets which, when combined, may give me a duplicate key. I am using a funnel to join them, followed by a remove duplicates in order to take the row from the first dataset if possible or the second if not in the first.

I am hash partitioning them on the possible duplicate key with a further field SORTORDER which contains 1 or 2 depending on the dataset.

Using the inbuilt sort in remove duplicates, I am sorting them by the key and SORTORDER ascending and taking the first duplicate.

Question
Given that I am hash partitioning them, do I really need the SORTORDER column to ensure I get the row from the first set or is this unnecessary

jwiles · Post by **jwiles** » Tue Mar 15, 2011 9:02 am

You should not include SORTORDER as part of the partitioning key...including it can potentially keep your duplicates from coming together in the same partition when running in parallel. However, DO keep SORTORDER as part of the sort key so that your desired dataset row is kept in the event of duplicates.

In your situation, you should not drop your duplicates within the sort. Place a Remove Duplicates stage after the sort, using just the key that identifies duplicates and keeping the first duplicate record.

Sort identifies and drops duplicates based on the entire sort key, including your SORTORDER column. As SORTORDER equals either 1 or 2 depending on the source, a Sort stage alone will not properly identify your duplicates between the sources.

Regards,

clarcombe · Post by **clarcombe** » Tue Mar 15, 2011 9:26 am

I apologise for my bad description. What you have said is actually what I am doing.

So I am right in assuming the data has to be sorted before removing the duplicates.

I was wondering if I just did a hash partition before the funnel whether the funnel would keep the order on merging the data and would then keep the first row from the first dataset before the second.

jwiles · Post by **jwiles** » Tue Mar 15, 2011 9:58 am

No, you do not have that guarantee with the funnel stage, hence the need to determine it in your logic. This would be true even with the Sort funnel option (functionally equivalent to what has already been described).

Regards,

clarcombe · Post by **clarcombe** » Tue Mar 15, 2011 10:17 am

Thats what I thought. Thanks for the confirmation

ajay.vaidyanathan · Post by **ajay.vaidyanathan** » Wed Mar 16, 2011 12:54 pm

Hi,
You can still retain your first record from first dataset into a funnel by changing the funnel type (which is by default Parallel to Sequencial).

Inside funnel properties you have an option to set sequence funnel. Try that out.

jwiles · Post by **jwiles** » Wed Mar 16, 2011 1:31 pm

Making those changes will still not provide the output which is desired, is unnecessary and will add to the job runtime. Changing the funnel type to Sequence will merely read all of the data from one input link first, then all of the next (an so on until all input links have been read). The data must still be sorted together in order to identify duplicates between the two datasets.

The most appropriate method is what has already been proposed.

Regards,