APT_ParallelSortMergeOperator : Unbalanced input

TonyInFrance · Post by **TonyInFrance** » Mon Jul 09, 2012 9:59 am

There are quite a few threads on this but I decided to open a new one since I didn't want to jump onto the end of an existing one as I've been advised.

My DSEE job uses a join operator on an Oracle database (parent) joining on a dataset which is linked to the join operator through a remove duplicate stage. Thus the design is somewhat like this:

ORACLE PARENT -> JOIN00 (principal)
DATASET -> REMOVE DUPLICATES -> JOIN00

In the remove duplicate stage I have sorted and hash partitioned my data on the key used for removing duplicates. I am sending the data thus obtained by clearing the partition because in the join stage for the lookup link I have kept my partitioning method to ENTIRE because using every other partition method caused loss of data. The join method is INNER.

Thus I get the famous warning:
APT_ParallelSortMergeOperator,1: Unbalanced input from partition 1:

What is surprising is if I copy the job unde a new name, these warning disappear for the first few runs. Sadly for my production environment cannot regularly create copies ad thus I need to get rid of these warnings.

What would you guys advise?

Thanks & regards

S. BASU

samdsx · Post by **samdsx** » Mon Jul 09, 2012 1:00 pm

if you have good memory use lookup stage or re-partisan depending on record count Or add MsgHandlers

TonyInFrance · Post by **TonyInFrance** » Mon Jul 09, 2012 1:04 pm

I had lookups initially which I replaced with joins since my lookup dataset is huge to the tune of 5 million lines.

battaliou · Post by **battaliou** » Mon Jul 09, 2012 2:09 pm

Is your remove duplicate key the same as your join key?

TonyInFrance · Post by **TonyInFrance** » Mon Jul 09, 2012 2:10 pm

jwiles · Post by **jwiles** » Mon Jul 09, 2012 2:10 pm

Do you partition and sort the data coming from Oracle?

Using entire partitioning with the join or merge stages is inappropriate in most situations. The fact that it appears to resolve an issue with a join operator indicates that you probably did not properly prepare (partition/sort) your data for the join operator.

What do you mean by the term "loss of data"? Remember, using an Inner join means that you will only receive matches on the output...all unmatched records will be dropped. If you are missing matches you know that you should see, then you most likely have incorrectly partitioned and/or sorted your data.

Properly partition and sort the data (using the same columns and options) for both inputs to the join operator.

Regards,

TonyInFrance · Post by **TonyInFrance** » Mon Jul 09, 2012 2:19 pm

I admit I was NOT partitioning or sorting the parent data coming in from Oracle. At the beginning I had left both links (left & right for the join) in auto mode and I was losing data.

By this I mean yes I was not seeing data that despite of an inner join I should have seen. Thus I changed the partitioning method for the right join to Entire from auto. That seemed to work fine. Eventually I went through a document that has been given to us by IBM recommending partitioning methods to adopt for different stages and for the join operator its said that we should hash and sort for all links. Thus that's what I did but unfortunately my data loss problem came back.

I HAD to deploy a version for production this evening and thus in order to not take any chances, I reverted to the Entire option I had tested with last Friday and delivered as per schedule but the warnings started coming whereas on Friday they weren't. I can vouch that these warnings eradicate themselves at times without anything changing in my job or data. Its pretty weird really.

As for my join operator since the data on the right link (the data I was joining on) was coming in from a remove duplicate stage (in which I had sorted and hash partitioned the data on the same key I am joining on) I propagated the partitioned data from the remove duplicate stage into the join stage where I also tried to maintain a 'same' partition. However I was still losing lines.

Lastly I hashed and sorted my parent data as well obviously on the same key without any success.

I need you guys to understand that I have chosen to use 'Entire' since that seems to tbe the only one working without data loss.

jwiles · Post by **jwiles** » Mon Jul 09, 2012 3:49 pm

Are your key columns in the two streams of the same type when you are performing your partition and sorts upon them? If they are of different types (int vs decimal vs varchar--hash doesn't convert between datatypes for you), or the values are different in any way (such as varchar with leading spaces vs trimmed varchar) then partitioning and sorting results may not be identical. Again, it comes back to data preparation

You can add $OSH_PRINT_SCHEMAS to your job, set it to 1 and examine the logs to see what the data types are for your key columns at the inputs and outputs of each operator. Look especially at the output schema of the Oracle operator and compare it to the schema of your dataset.

I understand your point that Entire partitioning appeared to resolve the issue for you, but please understand my point that it is likely an indication of an underlying data processing issue. If your data is properly prepared prior to partitioning/sorting, you should not need to resort to Entire partitioning and the overhead in data it is creating in your sorts and join.

Regards,

TonyInFrance · Post by **TonyInFrance** » Tue Jul 10, 2012 1:15 am

jwiles wrote:Are your key columns in the two streams of the same type when you are performing your partition and sorts upon them? If they are of different types (int vs decimal vs varchar--hash doesn't convert between datatypes for you), or the values are different in any way (such as varchar with leading spaces vs trimmed varchar) then partitioning and sorting results may not be identical. Again, it comes back to data preparation

This might be it. I realize that I'm joining two columns one of which is a VarChar and the other is Char. Both are of the same length but if the operator is as particular as you say I'm sure the problem lies there.

TonyInFrance · Post by **TonyInFrance** » Tue Jul 10, 2012 5:59 am

Apparently my above remedial step doesn't solve the problem.
I changed all the keys so as to have identical datatypes and field sizes.

The warning still appears.

jwiles · Post by **jwiles** » Tue Jul 10, 2012 8:12 am

For the data from the dataset: Can you verify that you are partitioning AND sorting the data prior to the remove duplicates stage?

Is the data in your dataset evenly distributed among it's partitions, or is there one partition which appears to have much of the data?

Also, do you know anything about the frequency distribution of the key values? It's possible that the warning is due to the nature of your data: There may be a key value for which there are an unusually large number of rows. The warning is due to the parallel engine attempting to preserve the sort order of a pre-sorted dataset when repartitioning, and can result from many records having the same key value or a large number of consecutive rows having the same hash results. The warning is issued when the operator has read 10000 rows from the same partition while looking for the next rows in the (previous) sort order and is having to buffer rows.

You can demote the warning to an information message as it is just a warning and not a fatal error. You might consider opening a ticket with your official support provider to see if they have additional recommendations.

Regards,

TonyInFrance · Post by **TonyInFrance** » Tue Jul 10, 2012 9:55 am

Cheers James. Below are some clarifications:

jwiles wrote:For the data from the dataset: Can you verify that you are partitioning AND sorting the data prior to the remove duplicates stage?

I am partitioning and sorting the data within the remove duplicates stage in the input tab of the same. Then I 'm propagating the partitioned data to the join stage which in turn is receiving the data using 'Entire' partition.

jwiles wrote:Is the data in your dataset evenly distributed among it's partitions, or is there one partition which appears to have much of the data?

Also, do you know anything about the frequency distribution of the key values? It's possible that the warning is due to the nature of your data: There may be a key value for which there are an unusually large number of rows. The warning is due to the parallel engine attempting to preserve the sort order of a pre-sorted dataset when repartitioning, and can result from many records having the same key value or a large number of consecutive rows having the same hash results. The warning is issued when the operator has read 10000 rows from the same partition while looking for the next rows in the (previous) sort order and is having to buffer rows.

My data is far from being evenly partitioned. At a time only one partition out of 20 is used since I am launching my job once for each postcode, where my source table is partitioned based on the postcode field. There are about 20 distinct postcodes. Each time I launch my job I pass the postcode as a parameter and thus it selects data pertaining to only that postcode. This data arrives on a single partition.

TonyInFrance · Post by **TonyInFrance** » Tue Jul 10, 2012 12:13 pm

I'd like to give some more precisions pertaining to my post just above:

Although its true that my source Oracle table is partitioned by postcode, thus 20 partitions for 20 postcodes, I have not been able to use the option 'enable partitioned reads' in my OCI because there are other tables on which I have inner / outer joins which are not partitioned at all. Thus at some places I use Oracle partitions (when all the tables are partitioned in Oracle), for some I have enabled partitioned reads with Row ID Hash partitions and yet other I do not use the partitioned reads option at all.

Are you sure that this is the only reason which is giving me these warnings?

An update I have (which is something I wrot about earlier as well) is these warnings do not appear at all times. The same job (unchanged) launched around an hour back threw NO warnings where as yesterday it did.

DSXchange

APT_ParallelSortMergeOperator : Unbalanced input

APT_ParallelSortMergeOperator : Unbalanced input

Re: APT_ParallelSortMergeOperator : Unbalanced input