Sort Stage Problem

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
snassimr
Premium Member
Premium Member
Posts: 281
Joined: Tue May 17, 2005 5:27 am

Sort Stage Problem

Post by snassimr »

Hi !

I use Sort Stage and After it Aggregator .

Why Sort dont transfer rows to Aggregator despite of I defined Sort order in Aggregator ?

How I can to do the It wont work serially : Sort and after it Aggregator
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Pre-sorting data prior to Aggregator is common practice, and helps the Aggregator stage to use memory efficiently (provided it's the grouping columns that are sorted).

Are you claiming that no rows are output from the Sort stage? If so, please create a job that consists of your job up to the Sort stage with its output being directed to a text file. Do any rows get written to the text file (and are they sorted)?

Then insert an Aggregator stage between the Sort stage and the Sequential File stage and run again. What are the link row counts now?

Where are the source data coming from? Would you consider a different sort strategy (for example a UNIX sort command - you CAN do this on Windows, because you have MKS toolkit) or an ORDER BY clause in SQL?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
snassimr
Premium Member
Premium Member
Posts: 281
Joined: Tue May 17, 2005 5:27 am

Post by snassimr »

I tried to do the same thing with seq file.

After all rows entered to Sort stage ( all 100000) the Job was ended immidiatly . It seems tahat Sort actually transfer rows but dont show it and rows count stay 0 from Aggregator

With order by I get 2500 rows per sec . I want more because I have very large table in source
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

snassimr,

with a sequential source file you really should pre-sort it at the DOS/UNIX before the job to increase the effective speed. 2500 rows per seconds is not necesarily fast, but if you change your job so that it doesn't write anything (put a constraint into a transform that doesn't transfer rows) and see if the speed goes up.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

snassimr wrote:I tried to do the same thing with seq file.

After all rows entered to Sort stage ( all 100000) the Job was ended immidiatly . It seems tahat Sort actually transfer rows but dont show it and rows count stay 0 from Aggregator

With order by I get 2500 rows per sec . I want more because I have very large table in source
Was the output link row count reported in the job log?
Was the output link row count reported when the Aggregator stage was not used, which I also asked you to do?

The Aggregator stage may not report rows output until it finishes. This is a function of how it works. It is possible that an Aggregator stage will "soak up" all its input rows, building the grouped and aggregated data in memory, before releasing them all with a rush at the end.

The actual row counts process should be reported in "active stage finishing" events in the job log.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you are expecting rows to 'flow through' the Sort stage, then you are mistaken in how you think it works. The combination of a Sort followed by an Aggregator will never work 'serially'.

The Sort stage will take all rows from the source and sort them before producing any output. The Aggregator may do something similar unless you've done things right to set it up:

* Presort your data in a manner that supports the aggregation being done.
* Assert that same sort order in the Aggregator stage.

Then - and only then - will the Aggregator pass rows though as the sort group changes. Otherwise, it too will take and process all rows before any output is produced.

Lie to the Aggregator stage about the sort order and it will crash with a 'row out of sequence' error. Get the sort order right but do it in a manner that does not support the aggregation being done and it will still hang on to all rows before it produces any output. In that case you have (in essence) wasted the time spent sorting.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply