look up stage interiors

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

look up stage interiors

Post by zulfi123786 »

Hi

Could some one please share some light over the below

1. The look up stage removes the duplicates if, multiple rows returned from link is disabled, does it mean that the look up operator has a tsort underneath to remove duplicates OR are the duplicates identified when the look up table is built and indexed ?

2. If there is no implicit sort, will the look up table created for reference data be different if the data is sorted and not sorted (on reference link)

3. If the look up reference data is sorted will it cause the searched to be faster or will it make the job slow (if the sort does not contribute to faster searches then obviously unnecesary sort adds to over all time)

Thanks
- Zulfi
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

1. Ignores duplicates , but issues a warning message about a duplicate key ref found.
2. No - The reference link in a lookup stage does NOT require a sort - nor does it sort internally - but it builds an index out of lookup keys which may be sorted+/hashed
3. You may be able to speed up a lookup merely by excluding duplicates if you need only one looked up value from ref link , sort may or may not help(??)- but not mandated - and as you noted sorts can be time & resource consuming by themselves as the ref. data size increases.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

rameshrr3 wrote:1. Ignores duplicates , but issues a warning message about a duplicate key ref found.
Usually the duplicate removal/identification algorithms rely on sorting so just wondering how duplicates are identified without sorting
rameshrr3 wrote:2. No - The reference link in a lookup stage does NOT require a sort - nor does it sort internally - but it builds an index out of lookup keys which may be sorted+/hashed
Okay, so what I am getting is the data is not sorted but only the key values are sorted and indexed with a mapping to physical/logical address to the repective row in the look up table. Did I get you right ?
- Zulfi
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

Not very sure about how internally duplicates are identified , but duplicate lookup keys will resolve to the same Byte Sequence , so probably there is an internal structure which has some ordering.

If you have access to version 8.x documentation , Id suggest you take a look at the parallel job advanced developer guide - the orchestrate lookup operator is described in a lot more detail .
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The key values are NOT sorted. A hash table index is built.
If there is more than one value associated with any one row in that hash table, then there are duplicates for that value.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply