Aggregator + Partition

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Aggregator + Partition

Post by karthi_gana »

All,

When i use Aggregator stage, i just use "Auto" partition type only. I mean i just leave the option to datastage. One of my collegue asked me the below question yesterday.

my task is to get

a) select max(salary) from employee

by using aggregator stage. which one is best partition method for this query? If you use "Auto" which one will be picked by default? where to see that?

b) select emp_id,max(salary) from employee group by emp_id

by using aggregator stage. which one is best partition method for this query? If you use "Auto" which one will be picked by default? where to see that?

I said...for the first one "Entire" and for the second one "Hash". is that correct?
Karthik
surfsup2
Participant
Posts: 4
Joined: Fri Feb 24, 2012 4:01 am

Re: Aggregator + Partition

Post by surfsup2 »

karthi_gana wrote:I said...for the first one "Entire" and for the second one "Hash". is that correct?
This looks like one of those bets people do on their projects when they get bored :)

Entire partition should be used for reference lookups of "small" sets only (where you shouldn't repartition the "big" set).

As I said, I miss the point of doing a max() on an entire table in Px, you can:
a) generate a dummy column and run a hash aggregator in sequential mode on it or ...
b) generate a dummy column and run a hash aggregator in parallel mode and then run a hash aggregator in sequential mode after that.


Cheers,
A
Ravi.K
Participant
Posts: 209
Joined: Sat Nov 20, 2010 11:33 pm
Location: Bangalore

Post by Ravi.K »

For Numeric Key Column Modules is best partition and for non numeric columns Hash is best partition.

If you want to see what partition Datastage selects when you select Partition as Auto then enable Dump score Environment variable to trace the Partition method.
Cheers
Ravi K
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Remember, Entire partitioning sends a copy of the entire set of data to each partition (as documented numerous times in the product documentation). If aggregator is running in parallel in 4 partitions, you would receive 4 identical results with entire partitioning. You normally will not use Entire partitioning on the input to an aggregator stage.

Hash partitioning is the most commonly used partition type and will work with multiple columns of any data type. Modulus partitioning will work with only 1 column, which must be an integer. For a single integer column, hash and modulus can provide different data distributions across the partitions depending upon the data values. I usually choose the one which provides the most even distribution.

For #1, there are two options:
a) Run the aggregator in sequential mode
b) Run two aggregators (both performing the max() aggregation), the first in parallel and the second in sequential (only useful if the job is running in multiple partitions).

For #2, hash is fine but you didn't specify which key column to hash on. Which should it be? (Hint: the select statement gives a big clue!)

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

emp_id will be chosed as the key for hash partition for #2. right ?
Karthik
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

Ravi.K wrote:If you want to see what partition Datastage selects when you select Partition as Auto then enable Dump score Environment variable to trace the Partition method.
I have included $APT_DUMP_SCORE in my job. But i didn't see the partition used by the datastage by default.
Karthik
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

If you're allowing datastage to choose the partitioning for you (that is, partition insertion has not been disabled), then datastage will use the keys on which you are aggregating as the partitioning keys..

If partition insertion is not disabled, you should see an entry in the Score which probably has "Partitioner" as part of the name. You may see this multiple times if the job also has sort, joins or other stages you may normally partition and/or sort for.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

if partition insertion is not disabled...
How should i check this?

$APT_NO_PART_INSERTITION = Flase

Do i need to include this in the job to see which partition is used by defauly?
Karthik
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

To see what type of partitioning is being used by the inserted partitioners, examine the job score after setting $APT_DUMP_SCORE=1 and look for the partitioners as I described in an earlier post to this thread. In most cases, DataStage will use hash partitioning when inserting a partitioner.

APT_NO_PARTITION_INSERTION simply control whether or not partitioners will be added where needed. If set to true or 1, partitioners will not be added. If set to false or 0, partitioners may be added depending upon your job design and options chosen.

Here's an example job score from a simple RowGen->Sort->Peek with all partitioning set to Auto. Can you spot the partitioner and see what type of partitioning was used?

Code: Select all

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Row_Generator_0)
      eOther(APT_HashPartitioner { key={ value=KEY_1, 
        subArgs={ asc }
      },
  key={ value=KEY_2, 
        subArgs={ asc }
      }
})<>eCollectAny
      op1[2p] (parallel Sort_2)}
ds1: {op1[2p] (parallel Sort_2)
      [pp] eSame=>eCollectAny
      op2[2p] (parallel Peek_1)}
It has 3 operators:
op0[1p] {(sequential Row_Generator_0)
    on nodes (
      node1[op0,p0]
    )}
op1[2p] {(parallel Sort_2)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[2p] {(parallel Peek_1)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
    )}
It runs 5 processes on 2 nodes.
Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

RowGen = Hash Partitioner
(sequential Row_Generator_0)
eOther(APT_HashPartitioner { key={ value=KEY_1,
subArgs={ asc }
},
key={ value=KEY_2,
subArgs={ asc }
It uses 2 keys.


Sort = Unable to find
})<>eCollectAny
op1[2p] (parallel Sort_2)}
ds1: {op1[2p] (parallel Sort_2)
Don't see any word before <>eCollectAny. So i have no clue on this partitioner.

Peek = Same partitioner
[pp] eSame=>eCollectAny
op2[2p] (parallel Peek_1)}

RowGen = Uses 1 Node
Sort = Uses 2 Nodes
Peek = Uses 2 Nodes
Karthik
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Post by karthi_gana »

The interesting part which i noticed was...

i can see three symbols

->
means sequential to sequential

=>

<No idea>

<>
means sequential to parallel

(Note, I gathered the meaning for the two symbols from my previous post)

is the above meanings are correct?

is there any other symbol used in SCORE?

where i can find more information on this ?
Karthik
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This kind of information is available by taking the Advanced DataStage class.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sri_vin
Premium Member
Premium Member
Posts: 20
Joined: Wed Aug 25, 2010 10:58 pm

Post by sri_vin »

=> means Parallel to Parallel (SAME).

As Ray said you will study this in Advanced Data Stage course.
Post Reply