Lookup Files created with Partitioning = Entire

rwierdsm · Post by **rwierdsm** » Thu Dec 21, 2006 11:00 am

All,

I've got two jobs.

The first job, called BuildLists, reads a couple of sequential files and populates two lookup lists, one hosted in a FileSet the other in a Dataset. Previously the BuildLists job was pointing to the default config file which defined just a single node. Now it points to a new config file with two nodes defined. Both lists were previously created with partitioning set to 'Entire'.

The second job reads both lookups. Partioning is set to 'Auto' throughout and it uses the 2 node config file.

I re-ran the BuildLists job today to recreate the lookups, as some new rows had been added. This is probably the first run since the new config file was set. Now my second job finds double entries in the lookups. My understanding was that setting the partitioning to 'Entire' would make all the rows in the lookup file available to all nodes in the second job, but it seems to make them all available twice. When I set the partitioning for one of the lookups (the fileset) to 'Auto', the second job no longer sees duplicate entries for that lookup.

Please educate me. This is not the behaviour I was expecting.

Rob W.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 21, 2006 2:25 pm

Does the BuildLists job append or overwrite? Is "Allow duplicates" set?

Entire in an SMP environment should share the records in the (virtual) Data Set across all nodes. There's nothing documented that suggests the behaviour you describe.

johnthomas · Post by **johnthomas** » Thu Dec 21, 2006 2:44 pm

I would suggest you use HASH partition since it will partition the lookup data based on the keys you specify. If you set to auto there is guarantee that the input value and the corresponding lookup data will be in the same node . If youspecify entire same lookup data will be duplicated one for each node

kumar_s · Post by **kumar_s** » Thu Dec 21, 2006 5:03 pm

The issue might be, when you ran the first job with Auto/Entire, the all data might be present in Entire node. And in second job, when you try to make the lookup partition as Entire or Auto, it might try to make the data available all the data in Entire node. Hence as suggested, if you first job is in entire partition, the lookup job can be HASH partitioned. Or in case of SMP, both the job can be hash partitioned (Provided if you are sure about the partitioning key). As Ray mentioned, take care of the data that were already available in the lookup set.

rwierdsm · Post by **rwierdsm** » Fri Jan 05, 2007 10:58 am

All,

After monkeying around with this for a while, ignoring it for a while and then returning to it (and swiping someone else's ideas), I've got a partial solution to this problem.

Along with the duplicate rows, we were getting a message:

Code: Select all

Rob1_Lookups: When checking operator: Operator of type "APT_LUTCreateOp": Will partition despite preserve-partitioning flag on dataset on input port 0.

To fix this, we did the following:

1. The DS job that creates the lookup dataset does so with partitioning set to 'Auto'
2. The DS job that reads the lookup dataset sets the partitioning to 'Entire' in the lookup stage(see note 1 below for more details)
3. In the dataset stage (the one connected to the lookup stage), set the Preserve Partitioning value to 'Clear' (see note 2 below)

Note 1 -
- go to properties in the lookup stage
- go to stage properties (right click in the grey part)
- click on the Inputs tab
- chose the lookup link(s) in the drop down box (DO NOT PICK THE INPUT STREAM LINK)
- click on the Partitioning tab
- choose Partitioning Type 'Entire'

Note 2
- go to properties in the dataset stage
- click on the Stage tab
- click on the Advanced tab
- set Preserve Partitioning drop down to 'Clear'

What a pain!

Rob W.