Lookup Files created with Partitioning = Entire

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Lookup Files created with Partitioning = Entire

Post by rwierdsm »

All,

I've got two jobs.

The first job, called BuildLists, reads a couple of sequential files and populates two lookup lists, one hosted in a FileSet the other in a Dataset. Previously the BuildLists job was pointing to the default config file which defined just a single node. Now it points to a new config file with two nodes defined. Both lists were previously created with partitioning set to 'Entire'.

The second job reads both lookups. Partioning is set to 'Auto' throughout and it uses the 2 node config file.

I re-ran the BuildLists job today to recreate the lookups, as some new rows had been added. This is probably the first run since the new config file was set. Now my second job finds double entries in the lookups. My understanding was that setting the partitioning to 'Entire' would make all the rows in the lookup file available to all nodes in the second job, but it seems to make them all available twice. When I set the partitioning for one of the lookups (the fileset) to 'Auto', the second job no longer sees duplicate entries for that lookup.

Please educate me. This is not the behaviour I was expecting.

Rob W.
Rob Wierdsma
Toronto, Canada
bartonbishop.com
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Does the BuildLists job append or overwrite? Is "Allow duplicates" set?

Entire in an SMP environment should share the records in the (virtual) Data Set across all nodes. There's nothing documented that suggests the behaviour you describe.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
johnthomas
Participant
Posts: 56
Joined: Mon Oct 16, 2006 7:32 am

Post by johnthomas »

I would suggest you use HASH partition since it will partition the lookup data based on the keys you specify. If you set to auto there is guarantee that the input value and the corresponding lookup data will be in the same node . If youspecify entire same lookup data will be duplicated one for each node
JT
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

The issue might be, when you ran the first job with Auto/Entire, the all data might be present in Entire node. And in second job, when you try to make the lookup partition as Entire or Auto, it might try to make the data available all the data in Entire node. Hence as suggested, if you first job is in entire partition, the lookup job can be HASH partitioned. Or in case of SMP, both the job can be hash partitioned (Provided if you are sure about the partitioning key). As Ray mentioned, take care of the data that were already available in the lookup set.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

All,

After monkeying around with this for a while, ignoring it for a while and then returning to it (and swiping someone else's ideas), I've got a partial solution to this problem.

Along with the duplicate rows, we were getting a message:

Code: Select all

Rob1_Lookups: When checking operator: Operator of type "APT_LUTCreateOp": Will partition despite preserve-partitioning flag on dataset on input port 0.
To fix this, we did the following:

1. The DS job that creates the lookup dataset does so with partitioning set to 'Auto'
2. The DS job that reads the lookup dataset sets the partitioning to 'Entire' in the lookup stage(see note 1 below for more details)
3. In the dataset stage (the one connected to the lookup stage), set the Preserve Partitioning value to 'Clear' (see note 2 below)

Note 1 -
- go to properties in the lookup stage
- go to stage properties (right click in the grey part)
- click on the Inputs tab
- chose the lookup link(s) in the drop down box (DO NOT PICK THE INPUT STREAM LINK)
- click on the Partitioning tab
- choose Partitioning Type 'Entire'

Note 2
- go to properties in the dataset stage
- click on the Stage tab
- click on the Advanced tab
- set Preserve Partitioning drop down to 'Clear'

What a pain!

Rob W.
Rob Wierdsma
Toronto, Canada
bartonbishop.com
Post Reply