Remove Duplicate stage returns more than one row
Moderators: chulett, rschirm, roy
Remove Duplicate stage returns more than one row
I am running a job on a grid and reading from a database and in a transformer I am splitting the input into mutiple outputs and in each output stream I am passing the key and a sub set of fields. I want to remove the dups in the stream based on the key (Keep last). However the remove dups stage returns more than one row. For one of the streams I have placed a sort stage before the remove dups but the same result.
I feel it is the method of partitiioning that is causing this problem. I have used hash with sorting in the streams without the sort and in the stream where I have used the sort. The sort stage has hash and sort and the remove dups has 'same'.
Any help will be most welcome
Thanks
Joseph[/img]
I feel it is the method of partitiioning that is causing this problem. I have used hash with sorting in the streams without the sort and in the stream where I have used the sort. The sort stage has hash and sort and the remove dups has 'same'.
Any help will be most welcome
Thanks
Joseph[/img]
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If that's really true, and you have hash partitioned on that key, then you should get one row out. Because all rows will be processed by one node, irrespective of how many nodes there are in your configuration. If Monitor shows more than one node processing rows, then your partitioning algoritm needs looking at.cooperjv wrote:There is only one distinct key value in the group
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
As I had mentioned it is a Grid Environment. I have set the $APT_GRID_ENABLE = YESray.wurlod wrote:If that's really true, and you have hash partitioned on that key, then you should get one row out. Because all rows will be processed by one node, irrespective of how many nodes there are in your configuration. If Monitor shows more than one node processing rows, then your partitioning algoritm needs looking at.cooperjv wrote:There is only one distinct key value in the group
$APT_GRID_PARTITIONS = 6 and
$APT_GRID_COMPUTENODE = 1
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
That does not matter. When the job is running, the nodes are allocated by the grid management software but the partitioning logic works exactly the same over those nodes. Therefore I stand by my previous post, or there's a bug that no-one else anywhere has reported in how partitioning works in a grid environment.
I'd scrutinise the actual partitioning you have specified in your job design just a tad more carefully, were I in charge of diagnosis.
I'd scrutinise the actual partitioning you have specified in your job design just a tad more carefully, were I in charge of diagnosis.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thanks for all the suggestions. I will try it out and post my findings.asorrell wrote:After re-reading your initial problem statement a few times I realized you said that you hash sorted on the key. I'd also recommend hash partitioning on the key in the sort stage and then use partitioning "same" on the RD stage.