Help on aggregation logic

pdntsap · Post by **pdntsap** » Sun Nov 20, 2011 5:12 pm

Hello,

We have a requirement where we need to group the input data based on say 20 columns. Let the columns be C1, C2, C3...C20. After grouping, some column values within each group need to compared with the last value for Column 20 in that goup. An aggregator stage can be used for grouping, I belive, but I am really lost in how I can retain the value of Column 20 of the last record in each group and move it forward for further processing. Any help will be greatly appreciated.

Thanks.

chulett · Post by **chulett** » Sun Nov 20, 2011 6:19 pm

If you are grouping on all twenty columns, then won't each "group" have a single value for each column, including Column 20? Meaning there really won't be a last of several values in that group. Or by "last" do you mean previous as in the value of Column 20 from the previous group? If so, then it seems like stage variables in a following transformer could be leveraged for that task.

pdntsap · Post by **pdntsap** » Mon Nov 21, 2011 7:47 am

Yes Craig. Grouping would produce just produce one record for each group. So, I was joining(join keys were the group keys) the output of the aggregator with the original data so that I get the original data (grouped according to the 20 colums) and the count of records in each group.

Going back and looking at the requirements, I may need to rethink my logic. I need the sort the data based on twenty columns. I need to do some processing on the sorted rows and delete the last record from each group (if some columns in the last record satisfy some requirements). Any suggestions in implementing the above logic?

One method might be sorting and then grouping on the 20 keys to get a count of the number of records in each group. Then join the output of the aggregator with the original data to get all the rows of the original data and the count of the number of rows in each group. Use a transformer stage and make use of stage variable to keep the count of rows and when this stage variable equals the row count for each group, delete the row, reset the stage varaible and repeat the logic for the other groups. I have not yet implemented this logic but am I in the right direction?

Thanks.