Merge Stage yeilding 'holes' in the output data

Bill_G · Post by **Bill_G** » Wed May 31, 2006 11:15 am

We had an unusual problem with the output from a Merge stage where the records would have blank values for columns - arbitrarily. It looked as if Round Robin partitioning had been selected. We have the Partitioning set to Auto for all of the input links when the behavior is observed.

To give some perspective, we are using a transformer to split a data flow into 10 or more links, a lookup is executed on each data flow in parallel (some sparse, some normal) then the data is merged back together through a merge stage. When we notice the holes in the data, we strategically placed sequential files prior to the merge to ensure the lookups were indeed finding matches and compared the data to the output table on the other side of the Merge (where the 'holes' or blank columns were found).

We experiemented, and found the setting the partitioning on the input links to the Merge Stage to 'Entire' yeilded the correct result. According to the PX Guide:

The following partitioning methods are available:
&#56256;&#56452; (Auto). DataStage attempts to work out the best partitioning
method depending on execution modes of current and preceding
stages and how many nodes are specified in the Configuration
file. This is the default collection method for the Merge stage.

Can anyone recommend best practices for partitioning with keyed input to a merge? Has anyone run into this same issue?

Thanks!

ray.wurlod · Post by **ray.wurlod** » Wed May 31, 2006 3:15 pm

Merge is a "master and updates" model. Would a Funnel stage fuflil your requirement more exactly?

Bill_G · Post by **Bill_G** » Wed May 31, 2006 3:48 pm

Ray,

First, Thanks for the idea.

I'll give it a try, at first glance, I don't see why it wouldn't work. To reiterate the process, we are breaking a record into muliple parts, executing lookups in parallel on each link, then bringing the data back together along with the new columns gained in the lookup into a single link.

The Merge seemed to work with Input Partitioning set to Entire, but I wonder if the Sorted Funnel will use less resources.

Bill

ray.wurlod · Post by **ray.wurlod** » Wed May 31, 2006 3:53 pm

Entire guarantees that all records will exist on each partition, so all valid lookups will work. If you use any other partitioning algorithm with the Lookup, Merge or Join stages you must make sure that the Data Sets are identically (sorted and) partitioned on the lookup key columns.

Bill_G · Post by **Bill_G** » Wed May 31, 2006 3:58 pm

So if I had 20 input links to the Merge (one for each lookup and one for the master flow) and set them all to Auto, it is possible that DS would calculate the partitioning strategy for one of the links differently than the other 19 and cause the data to have missing values for columns on the record. Correct?

ray.wurlod · Post by **ray.wurlod** » Wed May 31, 2006 6:20 pm

Correct, if you had partitioned the data feeding the update links differently when creating them. Do you check what partitioning has been used?

Bill_G · Post by **Bill_G** » Wed May 31, 2006 7:37 pm

We set partitioning on all of the links to 'Auto'. Can I see the method DS chooses to use when using this setting?

kumar_s · Post by **kumar_s** » Wed May 31, 2006 10:03 pm

Yes you can check in DUMP SCORE. But if select the hash partition on the key on which you merge to the all of the links it should work. By selecting Auto it is not gaurented the right partition is selected. You can find some complaints in some earliar versions. And in most cases Roundrobin wont help you in finding the right same key at the various nodes of the Maset and Updates. And hence it is not the appropriate partition for this stage.

ray.wurlod · Post by **ray.wurlod** » Wed May 31, 2006 10:33 pm

You have to enable APT_DUMP_SCORE environment variable.
Hint: make it a job parameter, then you can choose whether or not to use on any particular job run.