Aggregator - Comparison with & without asserting Sort or

vnspn · Post by **vnspn** » Fri Apr 20, 2007 9:56 am

Hi,

I did a sample test and found this, a bit strange result.

I did a comparison run using Aggregator 'without' and 'with' asserting that the incoming rows are sorted. I tested it for about 1 million incoming rows with data being in a sorted order.

I expected that the Job should perform better if we assert to the Aggregator that the data is sorted. But unlikely for me, running the Job with asserting the data to be sorted took a little more time than running it without asserting it as sorted.

Is it justifiable that why when asserting the sort order. it takes more time?

Thanks.

chulett · Post by **chulett** » Fri Apr 20, 2007 11:22 am

No, it should take significantly less time - if done properly. You need to sort in a manner that supports the aggregation being done. Otherwise, your 'sort' time is wasted as the stage cannot utilize the data in the order you've sent it.

You can tell you got an issue if you sort the data, assert that sorted order and the stage is still a 'bottleneck' - all rows flow in before any flow out. You haven't sorted the incoming data properly. Proper sorting will allow output rows to flow from the stage at every 'sort break'.

vnspn · Post by **vnspn** » Fri Apr 20, 2007 1:08 pm

The incoming data is already sorted on the column that I want it to be aggregated. The source is from a flat file.

The column on which it is sorted contains a combination of alpha-numeric values. So it is not in an ascending or descending order that DataStage expects but it is groups based on a particular order.

So, I specified the option 'Ignore' as the Sort Order in the inputs tab of the Aggregator. It works as expected; output rows flow from the stage whenever there is a 'sort break'.

But still it takes more time than running without asserting that it is sorted.

chulett · Post by **chulett** » Fri Apr 20, 2007 1:28 pm

vnspn wrote:So, I specified the option 'Ignore' as the Sort Order in the inputs tab of the Aggregator.

Ignore is a dangerous option. Are you sure you truly have something more complex than can be declared as ascending or descending? Could it not be properly sorted so that it can be? Sure seems like it could be with a single column but you'd have to post examples of your data to confirm.

And are you sure you get identical output in both cases?

DSguru2B · Post by **DSguru2B** » Fri Apr 20, 2007 1:30 pm

vnspn wrote:The incoming data is already sorted on the column that I want it to be aggregated. The source is from a flat file.

Plus, the sorting is to be done and specified on the grouping key and not on columns that are aggregated.

vnspn · Post by **vnspn** » Fri Apr 20, 2007 2:13 pm

Yes, I do get the same identical output in both cases.

Ok, here is a sample of the column values that is in a particular sorted order,

001ZQ
002ZQ
001ZR
001ZT
E4400
E4400

If I specify this as to be 'Ascending' in sort order, then I get an error as DataStage expects the 3rd row to be in the 2nd position. Hope I'm clear here. The rows are in a particular grouped order.

DSguru2B, sorry for making a mistake while typing. It is sorted on the grouping key only.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 20, 2007 3:52 pm

When you assert that the data are sorted, the Aggreagator stage must include a check for "row out of order" - and abort if one is found.

It is probably this check that accounts for the difference.