DSXchange

Posted: **Wed May 30, 2012 7:58 am**

hi,
I have to implement below logic in datastage(not with custom query)

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1 .

as of now iam following below approach.

Use a copy stage to split the incoming records into two streams. One stream goes to an Aggregator stage that groups the records by key field(s) and counts the number of records in each group and outputs the results to the COUNT field. The output from Aggregator stage is then joined to the other stream using a Join stage on key field(s) and the results are then passed to Transformer stage. In Transformer, you could put a constraint like COUNT>2.

Is there any best way to do this?

Posted: **Wed May 30, 2012 8:11 am**

sorry ,Small correction

Use a copy stage to split the incoming records into two streams. One stream goes to an Aggregator stage that groups the records by key field(s) and counts the number of records in each group and outputs the results to the COUNT field. The output from Aggregator stage is then joined to the other stream using a Join stage on key field(s) and the results are then passed to Transformer stage. In Transformer, you could put a constraint like COUNT>1

Posted: **Wed May 30, 2012 4:38 pm**

Can you think of any other way to do this? If not, what you have is the best (only) way.

Posted: **Thu May 31, 2012 3:56 am**

Had posted an incorrect thought...Deleting it.

Posted: **Thu May 31, 2012 8:55 am**

Your solution is probably the best way for versions of IS prior to 8.5. Make certain that you partition and sort correctly prior to the copy stage.

For 8.5 and above, a sort stage with keyChange enabled, followed by a transformer is the simplest. Set a stage variable to @FALSE when keyChange=1 and LastRowInGroup()=1 (this means the group has only one row), otherwise set it to @TRUE. Then in the constraint for your output link only allow the rows where the stage variable is @TRUE.

Regards,

Posted: **Sun Jun 03, 2012 12:03 am**

Hi jwiles ,

thank for your mail

Is this approach will work on two nodes or four nodes also?

Posted: **Sun Jun 03, 2012 1:26 am**

It is always the case that a solution that works on two nodes will work on four, or 2000. Such is guaranteed by the way that the parallel execution engine works.

Posted: **Sun Jun 03, 2012 2:04 am**

jwiles wrote:Your solution is probably the best way for versions of IS prior to 8.5. Make certain that you partition and sort correctly prior to the copy stage.

For 8.5 and above, a sort stage with keyChange enabled, followed by a transformer is the simplest. Set a stage variable to @FALSE when keyChange=1 and LastRowInGroup()=1 (this means the group has only one row), otherwise set it to @TRUE. Then in the constraint for your output link only allow the rows where the stage variable is @TRUE.

Regards,

@Jwiles

I have below clarification .Could you please clarify

when i gone through the below link

http://publib.boulder.ibm.com/infocente ... eaks2.html

we need to use LastRowInGroup() function with argument but u used without argument[ LastRowInGroup()=1 ].Please explain how it works without argument and correct me if i am wrong?

Posted: **Sun Jun 03, 2012 11:04 pm**

My example was not a literal example of how to use the function, it was an example with the intent of indicating that LastRowInGroup is the function you should use. Use it as the documentation indicates (i.e. with the required arguments).

"function_name()" is a common method of writing the name of a function in a text conversation. Sorry for the confusion...

Regards,

Posted: **Mon Jun 04, 2012 11:03 am**

@jwiles

Thanks for clarifying the doubt and your help is really appreciated.