Page 1 of 1

look up stage interiors

Posted: Thu Mar 07, 2013 8:57 am
by zulfi123786
Hi

Could some one please share some light over the below

1. The look up stage removes the duplicates if, multiple rows returned from link is disabled, does it mean that the look up operator has a tsort underneath to remove duplicates OR are the duplicates identified when the look up table is built and indexed ?

2. If there is no implicit sort, will the look up table created for reference data be different if the data is sorted and not sorted (on reference link)

3. If the look up reference data is sorted will it cause the searched to be faster or will it make the job slow (if the sort does not contribute to faster searches then obviously unnecesary sort adds to over all time)

Thanks

Posted: Thu Mar 07, 2013 3:47 pm
by rameshrr3
1. Ignores duplicates , but issues a warning message about a duplicate key ref found.
2. No - The reference link in a lookup stage does NOT require a sort - nor does it sort internally - but it builds an index out of lookup keys which may be sorted+/hashed
3. You may be able to speed up a lookup merely by excluding duplicates if you need only one looked up value from ref link , sort may or may not help(??)- but not mandated - and as you noted sorts can be time & resource consuming by themselves as the ref. data size increases.

Posted: Thu Mar 07, 2013 11:37 pm
by zulfi123786
rameshrr3 wrote:1. Ignores duplicates , but issues a warning message about a duplicate key ref found.
Usually the duplicate removal/identification algorithms rely on sorting so just wondering how duplicates are identified without sorting
rameshrr3 wrote:2. No - The reference link in a lookup stage does NOT require a sort - nor does it sort internally - but it builds an index out of lookup keys which may be sorted+/hashed
Okay, so what I am getting is the data is not sorted but only the key values are sorted and indexed with a mapping to physical/logical address to the repective row in the look up table. Did I get you right ?

Posted: Fri Mar 08, 2013 1:22 pm
by rameshrr3
Not very sure about how internally duplicates are identified , but duplicate lookup keys will resolve to the same Byte Sequence , so probably there is an internal structure which has some ordering.

If you have access to version 8.x documentation , Id suggest you take a look at the parallel job advanced developer guide - the orchestrate lookup operator is described in a lot more detail .

Posted: Fri Mar 08, 2013 7:32 pm
by ray.wurlod
The key values are NOT sorted. A hash table index is built.
If there is more than one value associated with any one row in that hash table, then there are duplicates for that value.