Why not hash partitioning for lookup stage

udayanguha · Post by **udayanguha** » Fri Oct 23, 2015 5:33 pm

Why is it that we can use a hash partition for join but not for lookup.
Also, why can we use entire partition for join stage. I mean hash partition the master data and entire partition the reference data for a join stage.

ray.wurlod · Post by **ray.wurlod** » Sat Oct 24, 2015 10:57 pm

You can use any partitioning anywhere. But only certain partitioning will achieve correct results when key adjacency is required.

It is actually best practice to partition both inputs of the Lookup stage identically. (Auto) uses Entire because there's no requirement that the column names be the same. Entire is guaranteed to work in all cases.

So are Hash or Modulus, provided you've partitioned using the correct columns.

Using Entire for Join and Merge stage will results in far too many rows being output, since every row on the link will be duplicated on every node.

Also, where the configuration file specifies multiple machines, Entire will require rows to be sent across the network to other nodes. That is an overhead that can be avoided by using Hash or Modulus.

udayanguha · Post by **udayanguha** » Sun Oct 25, 2015 6:11 pm

As far as I know, lookup stage (in auto mode) partitions the master data in any partition (except entire) and reference data in entire partition. Is my understanding correct? Because if lookup uses entire partition on both links (master and reference) then output will have duplicate data.

In the same way if a join uses any partitioning on master (except entire) and entire partition on reference data, would the output be not correct.

priyadarshikunal · Post by **priyadarshikunal** » Tue Oct 27, 2015 2:31 am

By using entire partitioning on reference link (expected to be less data than stream link) it eliminates the requirement of repartitioning the input stream link data, and to answer you question, Yes, it insert entire partitioning on reference link by default and entire partitioning ensures that you will always get a match if its present in reference data. You can use hash partitioning but make sure both links are partitioned on the same key and data type of keys on both link are exactly the same.

And for the second one, technically, you may get the correct result given no duplicates are there. But what is the point you are trying to make here. Join is generally used where there is large amount of data present in both links and entire will need contiguous memory block to fit the entire data in one link which should not be done.

udayanguha · Post by **udayanguha** » Wed Oct 28, 2015 7:58 pm

So, if I understood the points correctly-
If reference data is not very large and if I use either join or lookup stage with entire partitioning, output would be the same? Also, in this case performance would also be same, is it?

ray.wurlod · Post by **ray.wurlod** » Wed Oct 28, 2015 8:18 pm

If you are able to use a Lookup stage in an SMP environment chances are that it will result in a job that finishes faster, because there is no requirement to sort the data. Entire partitioning is managed via shared memory.

udayanguha · Post by **udayanguha** » Sun Nov 01, 2015 9:01 pm

Thanks Ray, sorry for being so questionative but I still have a slight doubt.
So lookup stage will work faster because no sorting is required.
Going by the same logic, if I still use a join stage with entire partition and disable the APT_NO_SORT_INSERTION environment variable, will join and lookup give me same performance and same output.

ray.wurlod · Post by **ray.wurlod** » Sun Nov 01, 2015 9:45 pm

Using Entire partitioning with a Join stage will give you the Cartesian product of the inputs. You don't want to go there.

The remainder of your questions are moot.

Believe the documentation. The Join stage requires that its inputs be identically partitioned using a key-based partitioning algorithm (to ensure key adjacency). The framework will insert a partitioner if you have not specified key-based partitioning. The Join stage also requires that its inputs be identially sorted on the Join keys, so that it only has to deal with a single join key value in memory at a time. By disabling insertion of tsort operators you risk the job aborting with "row out of order" errors, or of incorrect results in the case of an outer join.

If you are 100% confident that your data are correctly sorted prefer the "check only" environment variable; this means that the inserted tsort operator will not actually sort the data; it will only check that they are sorted.