Aggregator + Partition

karthi_gana · Post by **karthi_gana** » Fri Mar 23, 2012 3:36 am

All,

When i use Aggregator stage, i just use "Auto" partition type only. I mean i just leave the option to datastage. One of my collegue asked me the below question yesterday.

my task is to get

a) select max(salary) from employee

by using aggregator stage. which one is best partition method for this query? If you use "Auto" which one will be picked by default? where to see that?

b) select emp_id,max(salary) from employee group by emp_id

by using aggregator stage. which one is best partition method for this query? If you use "Auto" which one will be picked by default? where to see that?

I said...for the first one "Entire" and for the second one "Hash". is that correct?

surfsup2 · Post by **surfsup2** » Fri Mar 23, 2012 5:04 am

karthi_gana wrote:I said...for the first one "Entire" and for the second one "Hash". is that correct?

This looks like one of those bets people do on their projects when they get bored

Entire partition should be used for reference lookups of "small" sets only (where you shouldn't repartition the "big" set).

As I said, I miss the point of doing a max() on an entire table in Px, you can:
a) generate a dummy column and run a hash aggregator in sequential mode on it or ...
b) generate a dummy column and run a hash aggregator in parallel mode and then run a hash aggregator in sequential mode after that.

Cheers,
A

Ravi.K · Post by **Ravi.K** » Fri Mar 23, 2012 6:57 am

For Numeric Key Column Modules is best partition and for non numeric columns Hash is best partition.

If you want to see what partition Datastage selects when you select Partition as Auto then enable Dump score Environment variable to trace the Partition method.

jwiles · Post by **jwiles** » Fri Mar 23, 2012 8:53 am

Remember, Entire partitioning sends a copy of the entire set of data to each partition (as documented numerous times in the product documentation). If aggregator is running in parallel in 4 partitions, you would receive 4 identical results with entire partitioning. You normally will not use Entire partitioning on the input to an aggregator stage.

Hash partitioning is the most commonly used partition type and will work with multiple columns of any data type. Modulus partitioning will work with only 1 column, which must be an integer. For a single integer column, hash and modulus can provide different data distributions across the partitions depending upon the data values. I usually choose the one which provides the most even distribution.

For #1, there are two options:
a) Run the aggregator in sequential mode
b) Run two aggregators (both performing the max() aggregation), the first in parallel and the second in sequential (only useful if the job is running in multiple partitions).

For #2, hash is fine but you didn't specify which key column to hash on. Which should it be? (Hint: the select statement gives a big clue!)

Regards,

karthi_gana · Post by **karthi_gana** » Sun Mar 25, 2012 1:13 am

emp_id will be chosed as the key for hash partition for #2. right ?

karthi_gana · Post by **karthi_gana** » Sun Mar 25, 2012 1:58 am

Ravi.K wrote:If you want to see what partition Datastage selects when you select Partition as Auto then enable Dump score Environment variable to trace the Partition method.

I have included $APT_DUMP_SCORE in my job. But i didn't see the partition used by the datastage by default.

jwiles · Post by **jwiles** » Sun Mar 25, 2012 9:53 pm

If you're allowing datastage to choose the partitioning for you (that is, partition insertion has not been disabled), then datastage will use the keys on which you are aggregating as the partitioning keys..

If partition insertion is not disabled, you should see an entry in the Score which probably has "Partitioner" as part of the name. You may see this multiple times if the job also has sort, joins or other stages you may normally partition and/or sort for.

Regards,

karthi_gana · Post by **karthi_gana** » Wed Mar 28, 2012 4:50 am

if partition insertion is not disabled...

How should i check this?

$APT_NO_PART_INSERTITION = Flase

Do i need to include this in the job to see which partition is used by defauly?

jwiles · Post by **jwiles** » Wed Mar 28, 2012 8:02 am

To see what type of partitioning is being used by the inserted partitioners, examine the job score after setting $APT_DUMP_SCORE=1 and look for the partitioners as I described in an earlier post to this thread. In most cases, DataStage will use hash partitioning when inserting a partitioner.

APT_NO_PARTITION_INSERTION simply control whether or not partitioners will be added where needed. If set to true or 1, partitioners will not be added. If set to false or 0, partitioners may be added depending upon your job design and options chosen.

Here's an example job score from a simple RowGen->Sort->Peek with all partitioning set to Auto. Can you spot the partitioner and see what type of partitioning was used?

Code: Select all

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Row_Generator_0)
      eOther(APT_HashPartitioner { key={ value=KEY_1, 
        subArgs={ asc }
      },
  key={ value=KEY_2, 
        subArgs={ asc }
      }
})<>eCollectAny
      op1[2p] (parallel Sort_2)}
ds1: {op1[2p] (parallel Sort_2)
      [pp] eSame=>eCollectAny
      op2[2p] (parallel Peek_1)}
It has 3 operators:
op0[1p] {(sequential Row_Generator_0)
    on nodes (
      node1[op0,p0]
    )}
op1[2p] {(parallel Sort_2)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[2p] {(parallel Peek_1)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
    )}
It runs 5 processes on 2 nodes.

Regards,

karthi_gana · Post by **karthi_gana** » Thu Mar 29, 2012 2:42 am

RowGen = Hash Partitioner

(sequential Row_Generator_0)
eOther(APT_HashPartitioner { key={ value=KEY_1,
subArgs={ asc }
},
key={ value=KEY_2,
subArgs={ asc }

It uses 2 keys.

Sort = Unable to find

})<>eCollectAny
op1[2p] (parallel Sort_2)}
ds1: {op1[2p] (parallel Sort_2)

Don't see any word before <>eCollectAny. So i have no clue on this partitioner.

Peek = Same partitioner

[pp] eSame=>eCollectAny
op2[2p] (parallel Peek_1)}

RowGen = Uses 1 Node
Sort = Uses 2 Nodes
Peek = Uses 2 Nodes

karthi_gana · Post by **karthi_gana** » Thu Mar 29, 2012 2:47 am

The interesting part which i noticed was...

i can see three symbols

->
means sequential to sequential

=>

<No idea>

<>
means sequential to parallel

(Note, I gathered the meaning for the two symbols from my previous post)

is the above meanings are correct?

is there any other symbol used in SCORE?

where i can find more information on this ?

ray.wurlod · Post by **ray.wurlod** » Thu Mar 29, 2012 3:08 am

This kind of information is available by taking the Advanced DataStage class.

sri_vin · Post by **sri_vin** » Fri Apr 13, 2012 1:07 am

=> means Parallel to Parallel (SAME).

As Ray said you will study this in Advanced Data Stage course.

DSXchange

Aggregator + Partition

Aggregator + Partition

Re: Aggregator + Partition