Sort Stage Problem

snassimr · Post by **snassimr** » Mon Aug 29, 2005 12:58 am

Hi !

I use Sort Stage and After it Aggregator .

Why Sort dont transfer rows to Aggregator despite of I defined Sort order in Aggregator ?

How I can to do the It wont work serially : Sort and after it Aggregator

ray.wurlod · Post by **ray.wurlod** » Mon Aug 29, 2005 1:19 am

Pre-sorting data prior to Aggregator is common practice, and helps the Aggregator stage to use memory efficiently (provided it's the grouping columns that are sorted).

Are you claiming that no rows are output from the Sort stage? If so, please create a job that consists of your job up to the Sort stage with its output being directed to a text file. Do any rows get written to the text file (and are they sorted)?

Then insert an Aggregator stage between the Sort stage and the Sequential File stage and run again. What are the link row counts now?

Where are the source data coming from? Would you consider a different sort strategy (for example a UNIX sort command - you CAN do this on Windows, because you have MKS toolkit) or an ORDER BY clause in SQL?

snassimr · Post by **snassimr** » Mon Aug 29, 2005 1:32 am

I tried to do the same thing with seq file.

After all rows entered to Sort stage ( all 100000) the Job was ended immidiatly . It seems tahat Sort actually transfer rows but dont show it and rows count stay 0 from Aggregator

With order by I get 2500 rows per sec . I want more because I have very large table in source

ArndW · Post by **ArndW** » Mon Aug 29, 2005 1:34 am

snassimr,

with a sequential source file you really should pre-sort it at the DOS/UNIX before the job to increase the effective speed. 2500 rows per seconds is not necesarily fast, but if you change your job so that it doesn't write anything (put a constraint into a transform that doesn't transfer rows) and see if the speed goes up.

ray.wurlod · Post by **ray.wurlod** » Mon Aug 29, 2005 2:12 am

snassimr wrote:I tried to do the same thing with seq file.

After all rows entered to Sort stage ( all 100000) the Job was ended immidiatly . It seems tahat Sort actually transfer rows but dont show it and rows count stay 0 from Aggregator

With order by I get 2500 rows per sec . I want more because I have very large table in source

Was the output link row count reported in the job log?
Was the output link row count reported when the Aggregator stage was not used, which I also asked you to do?

The Aggregator stage may not report rows output until it finishes. This is a function of how it works. It is possible that an Aggregator stage will "soak up" all its input rows, building the grouped and aggregated data in memory, before releasing them all with a rush at the end.

The actual row counts process should be reported in "active stage finishing" events in the job log.

chulett · Post by **chulett** » Mon Aug 29, 2005 7:21 am

If you are expecting rows to 'flow through' the Sort stage, then you are mistaken in how you think it works. The combination of a Sort followed by an Aggregator will never work 'serially'.

The Sort stage will take all rows from the source and sort them before producing any output. The Aggregator may do something similar unless you've done things right to set it up:

* Presort your data in a manner that supports the aggregation being done.
* Assert that same sort order in the Aggregator stage.

Then - and only then - will the Aggregator pass rows though as the sort group changes. Otherwise, it too will take and process all rows before any output is produced.

Lie to the Aggregator stage about the sort order and it will crash with a 'row out of sequence' error. Get the sort order right but do it in a manner that does not support the aggregation being done and it will still hang on to all rows before it produces any output. In that case you have (in essence) wasted the time spent sorting.