Job not behaving as per expectation

saraswati · Post by **saraswati** » Wed Jun 15, 2011 6:33 am

Hi Friends,

I have a Job which is designed as below:
<Oracle_Connector_Stage> -> <Copy_Stage> -> <Aggregator_State> -> <Sequential_File>

(1)Now my job is running on a configuration file with 2 logical nodes.
(2)The Oracle Connector Stage is running sequentially.
(3)The Copy Stage had been configured to run in parallel with Hash partitioning method with partitioning key specified as 'A'. The Force property is set ON. The preserve partioning is set as Set.
(4) The Aggregator State has the partitioning method specified as Same since my only Grouping key is also 'A'. The preserve partitioning is set as Clear.
(5) The Sequential File Stage is writting to a single file, so in Sequential mode.
(6) I have also included the environment variable - $ APT_RECORD_COUNTS and set its value to True so that I can see the record input/ouput for each node in which a Stage executes.

When I run the Job, however I see in the job log that only one node of the Copy and Aggregator Stage has worked on all the data while the other node did not ran with any data.
The data loaded in the Sequential File was correct. But I am not able to comprehend this Job behavior.

Please do help me in this.

regards,
saraswati

le thuong · Post by **le thuong** » Wed Jun 15, 2011 7:50 am

Which values of the partitioning key are coming out of the Oracle stage ? Hash partitioning guarantees that 2 records with the same partitioning key value end up in the same node; it does not guarantee that 2 records with different partitioning key values end up in 2 different nodes.

saraswati · Post by **saraswati** » Wed Jun 15, 2011 10:19 am

(1) The Oracle Connector stage has a user defined query which selects three columns : SELECT EMPNAME, DEPTNAME, SALARY FROM EMP
The Oracle Connector Stage has been set to run in Sequential Mode

(2) The Next Stage Copy is copying the records returned by the Oracle Connector Stage and passing it into the next Stage Aggregator.
The Copy stage is dropping the column EMPNAME from the output link.
The Copy Stage has been set to run in Hash partitioning mode with DEPTNAME as the partitioning key

(3) The third Stage Aggregator has this set:
Grouping Key: DEPTNAME
Aggregation Type: Calculation
Column for Calculation: SALARY
Maximum Value Output Column: MAXSALARY
The Partitioning Type is set as : Same
Because the Copy Stage is already Hash partitioning the data by DEPTNAME which is also the Grouping Key for Aggregator Stage.
Hence I am intending to maintain the partitions by Same partitioning

(4) The output of the Aggregator Stage is mapped to the Sequential File Stage. A One-to-One mapping. It is in Sequential Mode.

The problem I am seeing is the Copy and the Aggregator Stages are running in one node only for the whole set of data. I discovered this as I set $APT_RECORD_COUNTS to true.

This is to reiterate that data is being written correctly. The Job behavior is what bothers me.

Please do help me in this.

regards,
saraswati

jwiles · Post by **jwiles** » Wed Jun 15, 2011 10:48 am

To pose the question another way:

What values are present in DEPTNAME? Or, how many unique values are present? Your data can only be distributed as well as the values of the keys you are partitioning on.

Try setting the Preserve Partitioning option in your copy stage to Clear and rerun the job.

I assume there are no node contraints on any of your stages...

Regards,

saraswati · Post by **saraswati** » Wed Jun 15, 2011 10:49 am

Sorry for missing out the values in the table : EMP. Here they are:

EMPNAME DEPTNAME SALARY
----------- ------------- --------
HARRY IT 10000
SALLY HR 30000
DAVID IT 5000
MARK HR 7500
JANUARY IT 20000
JULIAN HR 15000

Since there are 2 logical nodes. So a Hash partition on DEPTNAME should have made 2 partitions with 3 records each and run them on the 2 nodes.
But it is running all 6 records on Node 1 and none on Node 0. For Copy Stage and Aggregator Stage both.

Please do help me in this.

regards,
saraswati

saraswati · Post by **saraswati** » Wed Jun 15, 2011 10:51 am

Hi James,

No there are no node constraints in use in the Job.

regards,
saraswati

saraswati · Post by **saraswati** » Wed Jun 15, 2011 11:00 am

Another thing. If I Set the Copy Stage to Round Robin (or Auto) and the downstream stages to Same then the records are evenly partitioned into the two logical Nodes.
But yes then the records are getting split-ted wrongly which off-course is of no help.

Also If I Set the Copy Stage to Round Robin (or Auto) and the Aggregator Stage to Hash on the key: DEPTNAME it causes re-partitioning (as it should) but then again it is executing on a single node.

I am not getting where I am erring.

regards,
saraswati

jwiles · Post by **jwiles** » Wed Jun 15, 2011 11:03 am

Reread this quote from an earlier post:

Hash partitioning guarantees that 2 records with the same partitioning key value end up in the same node; it does not guarantee that 2 records with different partitioning key values end up in 2 different nodes.

Key-based partitioning methods do NOT guarantee even distribution of data across nodes. They only guarantee that like-keyed records will be placed into the same node (partition).

You only have TWO distinct values for dept name and only TWO nodes in your configuration file. It's quite likely that the two values receive the same result for the Hash algorithm and as such will go to the same node. This is not unusual with such a small set of test data with limited distribution of values.

What is important is that your logic is working correctly (you are receiving the correct output results).

Regards,

saraswati · Post by **saraswati** » Wed Jun 15, 2011 11:18 am

Thanks. I understood what you meant! Since the values in DEPTNAME are sparse hence they landed in one partition.

I have one query.
The Configuration file in my Production System has 8 logical nodes.
If in there I receive 50,000 records for DEPTNAME='IT' and 50,000 records for DEPTNAME='HR' then the Job will cause all the 100,000 records to get processed on one logical node only which I think will not be a very pleasant scenario.

Can you please suggest what should I do in this situation so as to ensure that the keyed partitioning causes the data to land in as many partitions and run in as many nodes for data like this.

regards,
saraswati

jwiles · Post by **jwiles** » Wed Jun 15, 2011 11:42 am

At most, you will process in two partitions of the eight because you have only two distinct values of your partitioning key. The Hash result depends on the number of partitions, so you may see your data use two nodes instead of one when you change the number of partitions.

Look at the business rules implied by your logic: Determine the max salary for a department. How do you do that? Look at all of the values from that department in a single instance of Aggregator (i.e. in a single partition). Ultimately, Aggregator will need to see everything from Dept IT in one partition in order to provide the correct result for Dept IT.

Could you spread across 8 partitions? Yes, by changing your partitioning strategy (add another partitioning key or use round robin or random). However, you would no longer have all of IT (or HR) in the same partition and would then need to add additional logic (i.e. another aggregator running sequentially) to bring the department data back together to still get the correct result. If you're going to be running millions or billions of records with only a few departments, this might be worth the effort. For 50 thousand records, I wouldn't bother.

Regards,

singhald · Post by **singhald** » Thu Jun 16, 2011 6:57 am

what aggregation method you are using in aggregate stage

jwiles · Post by **jwiles** » Thu Jun 16, 2011 8:10 am

I believe he stated he was using the hash method for aggregator. So long as the number of distinct values stay low, this is fine. Changing the method will have no effect upon the distribution across nodes, tho.

Regards,

DSXchange

Job not behaving as per expectation

Job not behaving as per expectation

Re: Job not behaving as per expectation