Why not hash partitioning for lookup stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
udayanguha
Premium Member
Premium Member
Posts: 37
Joined: Wed Oct 29, 2014 10:48 pm
Location: Ohio

Why not hash partitioning for lookup stage

Post by udayanguha »

Why is it that we can use a hash partition for join but not for lookup.
Also, why can we use entire partition for join stage. I mean hash partition the master data and entire partition the reference data for a join stage.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can use any partitioning anywhere. But only certain partitioning will achieve correct results when key adjacency is required.

It is actually best practice to partition both inputs of the Lookup stage identically. (Auto) uses Entire because there's no requirement that the column names be the same. Entire is guaranteed to work in all cases.

So are Hash or Modulus, provided you've partitioned using the correct columns.

Using Entire for Join and Merge stage will results in far too many rows being output, since every row on the link will be duplicated on every node.

Also, where the configuration file specifies multiple machines, Entire will require rows to be sent across the network to other nodes. That is an overhead that can be avoided by using Hash or Modulus.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
udayanguha
Premium Member
Premium Member
Posts: 37
Joined: Wed Oct 29, 2014 10:48 pm
Location: Ohio

Post by udayanguha »

As far as I know, lookup stage (in auto mode) partitions the master data in any partition (except entire) and reference data in entire partition. Is my understanding correct? Because if lookup uses entire partition on both links (master and reference) then output will have duplicate data.

In the same way if a join uses any partitioning on master (except entire) and entire partition on reference data, would the output be not correct.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

By using entire partitioning on reference link (expected to be less data than stream link) it eliminates the requirement of repartitioning the input stream link data, and to answer you question, Yes, it insert entire partitioning on reference link by default and entire partitioning ensures that you will always get a match if its present in reference data. You can use hash partitioning but make sure both links are partitioned on the same key and data type of keys on both link are exactly the same.

And for the second one, technically, you may get the correct result given no duplicates are there. But what is the point you are trying to make here. Join is generally used where there is large amount of data present in both links and entire will need contiguous memory block to fit the entire data in one link which should not be done.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
udayanguha
Premium Member
Premium Member
Posts: 37
Joined: Wed Oct 29, 2014 10:48 pm
Location: Ohio

Post by udayanguha »

So, if I understood the points correctly-
If reference data is not very large and if I use either join or lookup stage with entire partitioning, output would be the same? Also, in this case performance would also be same, is it?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you are able to use a Lookup stage in an SMP environment chances are that it will result in a job that finishes faster, because there is no requirement to sort the data. Entire partitioning is managed via shared memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
udayanguha
Premium Member
Premium Member
Posts: 37
Joined: Wed Oct 29, 2014 10:48 pm
Location: Ohio

Post by udayanguha »

Thanks Ray, sorry for being so questionative but I still have a slight doubt.
So lookup stage will work faster because no sorting is required.
Going by the same logic, if I still use a join stage with entire partition and disable the APT_NO_SORT_INSERTION environment variable, will join and lookup give me same performance and same output.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Using Entire partitioning with a Join stage will give you the Cartesian product of the inputs. You don't want to go there.

The remainder of your questions are moot.

Believe the documentation. The Join stage requires that its inputs be identically partitioned using a key-based partitioning algorithm (to ensure key adjacency). The framework will insert a partitioner if you have not specified key-based partitioning. The Join stage also requires that its inputs be identially sorted on the Join keys, so that it only has to deal with a single join key value in memory at a time. By disabling insertion of tsort operators you risk the job aborting with "row out of order" errors, or of incorrect results in the case of an outer join.

If you are 100% confident that your data are correctly sorted prefer the "check only" environment variable; this means that the inserted tsort operator will not actually sort the data; it will only check that they are sorted.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply