Partitioning in Lookup Stage

patonp · Post by **patonp** » Tue Feb 20, 2007 3:01 pm

Despite reading through a large number of posts as well as the DS documentation, I'm still not entirely clear on some aspects of partitioning for lookup stages.

I'd like to perform a lookup against a text file and have chosen to use a lookup stage. By default, the "Auto" option is selected as the partition type for both the primary and reference streams. My questions are as follows:

1. With "Auto" selected for both the primary and reference streams, can I rely on DataStage to manage partitioning such that records will successfully be retrieved based on the Key Expression defined in the "main" lookup screen? (I want to make sure that DS will reference the correct partition for the lookup record.)

2. If the partition type is set to "Hash" in the primary stream, will setting the partition type and keys the same way in the reference stream also ensure that the lookup is successful? How is this better or worse than using the "Auto" setting?

3. If the partition type is set to "Entire" for the reference stream, does the partition type of the primary stream matter?

4. Apart from the Partition Key, there are two other types of keys that appear in the Lookup stage: the "Key" column defined in the metadata for a column and the Key Expression in the lookup. Do either of these have any impact on partitioning?

I realize that this is a long post, but I would appreciate any help you can provide.

Peter

kumar_s · Post by **kumar_s** » Tue Feb 20, 2007 5:19 pm

1. Most probably yes (as per documentation), but If I were you, I wont take risk to leave Datastage to decide upon the partition. I ll do it manually.

2. Yes if both the streams are partitioned on HASH on the lookup key, and if you are sure that both will be partitioned exactly the same, it will work. There may be some cases where, Char type in one stream and Varchar type in other stream, which will lead to mismatch.

3. Thats why, "Entire" is preferred. It can match any partition type from Input stream. Be aware that, "Entire" in the input stream will create dulplicates.

4. Partitioning on the Lookup key will make the right choice.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 20, 2007 8:13 pm

For a Lookup stage, "Auto" selects:

Round Robin for the stream input link

Entire for the reference input link(s)

The latter can have performance implications in a cluster/grid configuration, as all rows need to be moved to all nodes.

patonp · Post by **patonp** » Tue Feb 20, 2007 9:25 pm

Thanks to both of you for your quick (and helpful) responses.

Ray - I do have one follow-up question...

Your response suggests that "Auto" can mean different things in different stages. This certainly makes sense considering the various functions performed. In which DataStage document can I find out the behaviour of the "Auto" setting for each stage type? (I was looking through the documentation earlier today, and don't remember seeing this behaviour explicitly described for the lookup stage, though I was probably looking in the wrong place.)

kumar_s · Post by **kumar_s** » Tue Feb 20, 2007 9:32 pm

There is not document given for each stages, but you can find you self by producing the scores. Set APT_DUMP_SCORE in your job and look for the scores, it will show you the details.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 20, 2007 10:32 pm

It's taught in the IBM Advanced DataStage class. Therefore it's documented in the training manual. I'm not aware that it's documented in any of the PDF manuals that ship with the product.

patonp · Post by **patonp** » Wed Feb 21, 2007 7:12 am

Thanks again!