Remove Duplicate stage returns more than one row

cooperjv · Post by **cooperjv** » Sun Jun 06, 2010 8:05 am

I am running a job on a grid and reading from a database and in a transformer I am splitting the input into mutiple outputs and in each output stream I am passing the key and a sub set of fields. I want to remove the dups in the stream based on the key (Keep last). However the remove dups stage returns more than one row. For one of the streams I have placed a sort stage before the remove dups but the same result.

I feel it is the method of partitiioning that is causing this problem. I have used hash with sorting in the streams without the sort and in the stream where I have used the sort. The sort stage has hash and sort and the remove dups has 'same'.

Any help will be most welcome

Thanks

Joseph[/img]

ray.wurlod · Post by **ray.wurlod** » Sun Jun 06, 2010 12:45 pm

The Remove Duplicates stage will return as many rows as there are distinct key values in your data stream.

cooperjv · Post by **cooperjv** » Sun Jun 06, 2010 1:44 pm

ray.wurlod wrote:The Remove Duplicates stage will return as many rows as there are distinct key values in your data stream.

There is only one distinct key value in the group

chulett · Post by **chulett** » Sun Jun 06, 2010 4:21 pm

How many nodes? I would assume some kind of issue with your hash partitioning, in spite of the fact that you seem to have that covered.

ray.wurlod · Post by **ray.wurlod** » Sun Jun 06, 2010 5:27 pm

cooperjv wrote:There is only one distinct key value in the group

If that's really true, and you have hash partitioned on that key, then you should get one row out. Because all rows will be processed by one node, irrespective of how many nodes there are in your configuration. If Monitor shows more than one node processing rows, then your partitioning algoritm needs looking at.

cooperjv · Post by **cooperjv** » Sun Jun 06, 2010 5:56 pm

ray.wurlod wrote:
cooperjv wrote:There is only one distinct key value in the group
If that's really true, and you have hash partitioned on that key, then you should get one row out. Because all rows will be processed by one node, irrespective of how many nodes there are in your configuration. If Monitor shows more than one node processing rows, then your partitioning algoritm needs looking at.

As I had mentioned it is a Grid Environment. I have set the $APT_GRID_ENABLE = YES
$APT_GRID_PARTITIONS = 6 and
$APT_GRID_COMPUTENODE = 1

ray.wurlod · Post by **ray.wurlod** » Sun Jun 06, 2010 6:28 pm

That does not matter. When the job is running, the nodes are allocated by the grid management software but the partitioning logic works exactly the same over those nodes. Therefore I stand by my previous post, or there's a bug that no-one else anywhere has reported in how partitioning works in a grid environment.

I'd scrutinise the actual partitioning you have specified in your job design just a tad more carefully, were I in charge of diagnosis.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Sun Jun 06, 2010 7:11 pm

After re-reading your initial problem statement a few times I realized you said that you hash sorted on the key. I'd also recommend hash partitioning on the key in the sort stage and then use partitioning "same" on the RD stage.

cooperjv · Post by **cooperjv** » Sun Jun 06, 2010 9:17 pm

asorrell wrote:After re-reading your initial problem statement a few times I realized you said that you hash sorted on the key. I'd also recommend hash partitioning on the key in the sort stage and then use partitioning "same" on the RD stage.

Thanks for all the suggestions. I will try it out and post my findings.