Sort Stage, Stage Collector sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

wblack
Premium Member
Premium Member
Posts: 30
Joined: Thu Sep 23, 2010 7:55 am

Sort Stage, Stage Collector sort

Post by wblack »

I have some general questions regarding the sort stage the the collector sort properties of stages. Is it safe to assume that using the sort stage to sort on a key is equivalent to using the collector stable sort of lets say the transformer stage? What are the main differences/similarities of the two?

Let's assume we are running an 8-node configuration. Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?

From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage. Does this sound about right?

Lets also say we have a join coming out of a DB2 stage that's being joined with another table. Does the join run faster if the data is sorted on the same field? I assume so. This goes back to the original question about using a sort stage verses the collector stable sort of using the DB2 query to do a group by.
William Black
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Re: Sort Stage, Stage Collector sort

Post by qt_ky »

wblack wrote:Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?
No. Unless you go out of your way to force sequential sorting (like using a sort merge collector), then the sorting is done in parallel. Each node sorts its own partition or subgroup of the data set. Therefore, choosing how to partition the data is important for getting correct results.
wblack wrote:From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage.
That should depend on the processing power of your database server vs. your DataStage server and whether or not DB2 can do a parallel sort.
wblack wrote:Does the join run faster if the data is sorted on the same field?
Join stage requires sorted inputs. If the join inputs are DB2 ordered outputs, then you can set the properties to tell it don't sort; it's already sorted.
Choose a job you love, and you will never have to work a day in your life. - Confucius
wblack
Premium Member
Premium Member
Posts: 30
Joined: Thu Sep 23, 2010 7:55 am

Post by wblack »

You say "A Join stage requires sorted inputs. If the join inputs are DB2 ordered outputs, then you can set the properties to tell it don't sort; it's already sorted. "

Ok a join requires sorted input but does that mean if it's not sorted a sort is entered for you? Also does a join require that the sorted input be on a single node? In other words when a join happens does it get repartitioned to a single node?

Is there an environment variable you can use to see if a sort was added by DataStage?
William Black
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yes, add $APT_DUMP_SCORE to the job and set to true.
-craig

"You can never have too many knives" -- Logan Nine Fingers
wblack
Premium Member
Premium Member
Posts: 30
Joined: Thu Sep 23, 2010 7:55 am

Post by wblack »

Ok a join requires sorted input but does that mean if it's not sorted a sort is entered for you? Also does a join require that the sorted input be on a single node? In other words when a join happens does it get repartitioned to a single node?
William Black
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If there is no indication that the inputs of a Join stage are sorted, then a tsort operator will be inserted. There is no requirement for this to be sequential; each node is managed independently. It is vital, of course, that the data are correctly partitioned (on at least the first join key).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Try to get it out of your head that everything must happen on a single node. Are you seeing something that leads you to believe these things are running on a single node? It is possible to force sequential behavior in a parallel job, but it's not often going to be desirable.
Choose a job you love, and you will never have to work a day in your life. - Confucius
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

ray.wurlod wrote:It is vital, of course, that the data are correctly partitioned (on at least the first join key).
Why the first? If the join key is HOUSENM and POSTCODE, why can't it be partitioned on POSTCODE?
Phil Hibbs | Capgemini
Technical Consultant
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

ray.wurlod wrote:It is vital, of course, that the data are correctly partitioned (on at least the first join key).
Why the first? If the join key is HOUSENM and POSTCODE, why can't it be partitioned on POSTCODE?

Or, it could be partitioned by any column that is a superset of POSTCODE, e.g. POSTTOWN, or POSTCODE_AREA (e.g. the "W12" part of "W12 8QT").
Phil Hibbs | Capgemini
Technical Consultant
wblack
Premium Member
Premium Member
Posts: 30
Joined: Thu Sep 23, 2010 7:55 am

Post by wblack »

qt_ky wrote:Try to get it out of your head that everything must happen on a single node. Are you seeing something that leads you to believe these things are running on a single node? It is possible to force sequential behavior in a parallel job, but it's not often going to be desirable.
I don't have it in my head that everything needs to run on a single node. I'm working on a job that is extracting a large dataset (12M) and I was was trying to better understand where the bottleneck is happening.

I have an 8-node (4 local, 4 remote) configuration. When my job runs it's worse performance is 8 nodes and as I back the nodes down 7,6,5 the performance improves. When I run 4 local it's the best performance. I was thinking the problem is either with the copying of the datasets across the network or the job is getting repartitioned which pulls the data back to a single node.
William Black
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

wblack wrote:I was thinking the problem is either with the copying of the datasets across the network or the job is getting repartitioned which pulls the data back to a single node.
I think that this is not true, that repartitioning pulls the data back to a single node.
Phil Hibbs | Capgemini
Technical Consultant
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

@Phil:
Yes it can be partitioned on any of the keys but need to make sure that the data is evenly partitioned. For that you can use multiple keys.

You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.

Also, you are right about repartitioning.

@wblack
Its never true that the max number of nodes will give the best result, you need to find out the best suited number of nodes for your server. Also can there be a network delay causing the performance to degrade when using the remote machines?
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

priyadarshikunal wrote:...You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.
I thought that if I just set the partitioning to "Same", that DataStage would not automatically repartition for any reason. Surely it only does that if you let it with "Auto"?
Phil Hibbs | Capgemini
Technical Consultant
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

PhilHibbs wrote:I thought that if I just set the partitioning to "Same", that DataStage would not automatically repartition for any reason. Surely it only does that if you let it with "Auto"?
Yeah, forgot that one. :oops:
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
wblack
Premium Member
Premium Member
Posts: 30
Joined: Thu Sep 23, 2010 7:55 am

Post by wblack »

I have a rather elementary question. If a join expects sorted data and the data isn't sorted it adds a tsort. Is this correct? Also, if a sort stage is added before a join where the data isn't sorted will this make a join perform better?
William Black
Post Reply