Remove Duplicate stage returns more than one row

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
cooperjv
Premium Member
Premium Member
Posts: 29
Joined: Thu May 13, 2004 3:18 pm

Remove Duplicate stage returns more than one row

Post by cooperjv »

I am running a job on a grid and reading from a database and in a transformer I am splitting the input into mutiple outputs and in each output stream I am passing the key and a sub set of fields. I want to remove the dups in the stream based on the key (Keep last). However the remove dups stage returns more than one row. For one of the streams I have placed a sort stage before the remove dups but the same result.

I feel it is the method of partitiioning that is causing this problem. I have used hash with sorting in the streams without the sort and in the stream where I have used the sort. The sort stage has hash and sort and the remove dups has 'same'.

Any help will be most welcome

Thanks

Joseph[/img]
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Remove Duplicates stage will return as many rows as there are distinct key values in your data stream.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
cooperjv
Premium Member
Premium Member
Posts: 29
Joined: Thu May 13, 2004 3:18 pm

Post by cooperjv »

ray.wurlod wrote:The Remove Duplicates stage will return as many rows as there are distinct key values in your data stream.
There is only one distinct key value in the group
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

How many nodes? I would assume some kind of issue with your hash partitioning, in spite of the fact that you seem to have that covered. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

cooperjv wrote:There is only one distinct key value in the group
If that's really true, and you have hash partitioned on that key, then you should get one row out. Because all rows will be processed by one node, irrespective of how many nodes there are in your configuration. If Monitor shows more than one node processing rows, then your partitioning algoritm needs looking at.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
cooperjv
Premium Member
Premium Member
Posts: 29
Joined: Thu May 13, 2004 3:18 pm

Post by cooperjv »

ray.wurlod wrote:
cooperjv wrote:There is only one distinct key value in the group
If that's really true, and you have hash partitioned on that key, then you should get one row out. Because all rows will be processed by one node, irrespective of how many nodes there are in your configuration. If Monitor shows more than one node processing rows, then your partitioning algoritm needs looking at.
As I had mentioned it is a Grid Environment. I have set the $APT_GRID_ENABLE = YES
$APT_GRID_PARTITIONS = 6 and
$APT_GRID_COMPUTENODE = 1
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That does not matter. When the job is running, the nodes are allocated by the grid management software but the partitioning logic works exactly the same over those nodes. Therefore I stand by my previous post, or there's a bug that no-one else anywhere has reported in how partitioning works in a grid environment.

I'd scrutinise the actual partitioning you have specified in your job design just a tad more carefully, were I in charge of diagnosis.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

After re-reading your initial problem statement a few times I realized you said that you hash sorted on the key. I'd also recommend hash partitioning on the key in the sort stage and then use partitioning "same" on the RD stage.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
cooperjv
Premium Member
Premium Member
Posts: 29
Joined: Thu May 13, 2004 3:18 pm

Post by cooperjv »

asorrell wrote:After re-reading your initial problem statement a few times I realized you said that you hash sorted on the key. I'd also recommend hash partitioning on the key in the sort stage and then use partitioning "same" on the RD stage.
Thanks for all the suggestions. I will try it out and post my findings.
Post Reply