Use of funnel and sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Use of funnel and sort

Post by clarcombe »

I have two datasets which, when combined, may give me a duplicate key. I am using a funnel to join them, followed by a remove duplicates in order to take the row from the first dataset if possible or the second if not in the first.

I am hash partitioning them on the possible duplicate key with a further field SORTORDER which contains 1 or 2 depending on the dataset.

Using the inbuilt sort in remove duplicates, I am sorting them by the key and SORTORDER ascending and taking the first duplicate.

Question
Given that I am hash partitioning them, do I really need the SORTORDER column to ensure I get the row from the first set or is this unnecessary
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You should not include SORTORDER as part of the partitioning key...including it can potentially keep your duplicates from coming together in the same partition when running in parallel. However, DO keep SORTORDER as part of the sort key so that your desired dataset row is kept in the event of duplicates.

In your situation, you should not drop your duplicates within the sort. Place a Remove Duplicates stage after the sort, using just the key that identifies duplicates and keeping the first duplicate record.

Sort identifies and drops duplicates based on the entire sort key, including your SORTORDER column. As SORTORDER equals either 1 or 2 depending on the source, a Sort stage alone will not properly identify your duplicates between the sources.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

I apologise for my bad description. What you have said is actually what I am doing.

So I am right in assuming the data has to be sorted before removing the duplicates.

I was wondering if I just did a hash partition before the funnel whether the funnel would keep the order on merging the data and would then keep the first row from the first dataset before the second.
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

No, you do not have that guarantee with the funnel stage, hence the need to determine it in your logic. This would be true even with the Sort funnel option (functionally equivalent to what has already been described).

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

Thats what I thought. Thanks for the confirmation
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
ajay.vaidyanathan
Participant
Posts: 53
Joined: Fri Apr 18, 2008 8:13 am
Location: United States

Use of funnel and sort

Post by ajay.vaidyanathan »

Hi,
You can still retain your first record from first dataset into a funnel by changing the funnel type (which is by default Parallel to Sequencial).

Inside funnel properties you have an option to set sequence funnel. Try that out.
Regards
Ajay
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Making those changes will still not provide the output which is desired, is unnecessary and will add to the job runtime. Changing the funnel type to Sequence will merely read all of the data from one input link first, then all of the next (an so on until all input links have been read). The data must still be sorted together in order to identify duplicates between the two datasets.

The most appropriate method is what has already been proposed.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply