Merge Stage yeilding 'holes' in the output data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Bill_G
Premium Member
Premium Member
Posts: 74
Joined: Thu Oct 20, 2005 9:34 am

Merge Stage yeilding 'holes' in the output data

Post by Bill_G »

We had an unusual problem with the output from a Merge stage where the records would have blank values for columns - arbitrarily. It looked as if Round Robin partitioning had been selected. We have the Partitioning set to Auto for all of the input links when the behavior is observed.

To give some perspective, we are using a transformer to split a data flow into 10 or more links, a lookup is executed on each data flow in parallel (some sparse, some normal) then the data is merged back together through a merge stage. When we notice the holes in the data, we strategically placed sequential files prior to the merge to ensure the lookups were indeed finding matches and compared the data to the output table on the other side of the Merge (where the 'holes' or blank columns were found).

We experiemented, and found the setting the partitioning on the input links to the Merge Stage to 'Entire' yeilded the correct result. According to the PX Guide:

The following partitioning methods are available:
�� (Auto). DataStage attempts to work out the best partitioning
method depending on execution modes of current and preceding
stages and how many nodes are specified in the Configuration
file. This is the default collection method for the Merge stage.
Can anyone recommend best practices for partitioning with keyed input to a merge? Has anyone run into this same issue?

Thanks!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Merge is a "master and updates" model. Would a Funnel stage fuflil your requirement more exactly?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Bill_G
Premium Member
Premium Member
Posts: 74
Joined: Thu Oct 20, 2005 9:34 am

Post by Bill_G »

Ray,

First, Thanks for the idea.

I'll give it a try, at first glance, I don't see why it wouldn't work. To reiterate the process, we are breaking a record into muliple parts, executing lookups in parallel on each link, then bringing the data back together along with the new columns gained in the lookup into a single link.

The Merge seemed to work with Input Partitioning set to Entire, but I wonder if the Sorted Funnel will use less resources.


Bill
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Entire guarantees that all records will exist on each partition, so all valid lookups will work. If you use any other partitioning algorithm with the Lookup, Merge or Join stages you must make sure that the Data Sets are identically (sorted and) partitioned on the lookup key columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Bill_G
Premium Member
Premium Member
Posts: 74
Joined: Thu Oct 20, 2005 9:34 am

Post by Bill_G »

So if I had 20 input links to the Merge (one for each lookup and one for the master flow) and set them all to Auto, it is possible that DS would calculate the partitioning strategy for one of the links differently than the other 19 and cause the data to have missing values for columns on the record. Correct?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Correct, if you had partitioned the data feeding the update links differently when creating them. Do you check what partitioning has been used?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Bill_G
Premium Member
Premium Member
Posts: 74
Joined: Thu Oct 20, 2005 9:34 am

Post by Bill_G »

We set partitioning on all of the links to 'Auto'. Can I see the method DS chooses to use when using this setting?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Yes you can check in DUMP SCORE. But if select the hash partition on the key on which you merge to the all of the links it should work. By selecting Auto it is not gaurented the right partition is selected. You can find some complaints in some earliar versions. And in most cases Roundrobin wont help you in finding the right same key at the various nodes of the Maset and Updates. And hence it is not the appropriate partition for this stage.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You have to enable APT_DUMP_SCORE environment variable.
Hint: make it a job parameter, then you can choose whether or not to use on any particular job run.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply