Page 1 of 2

Sort Stage, Stage Collector sort

Posted: Sat Feb 25, 2012 11:26 am
by wblack
I have some general questions regarding the sort stage the the collector sort properties of stages. Is it safe to assume that using the sort stage to sort on a key is equivalent to using the collector stable sort of lets say the transformer stage? What are the main differences/similarities of the two?

Let's assume we are running an 8-node configuration. Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?

From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage. Does this sound about right?

Lets also say we have a join coming out of a DB2 stage that's being joined with another table. Does the join run faster if the data is sorted on the same field? I assume so. This goes back to the original question about using a sort stage verses the collector stable sort of using the DB2 query to do a group by.

Re: Sort Stage, Stage Collector sort

Posted: Sat Feb 25, 2012 12:27 pm
by qt_ky
wblack wrote:Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?
No. Unless you go out of your way to force sequential sorting (like using a sort merge collector), then the sorting is done in parallel. Each node sorts its own partition or subgroup of the data set. Therefore, choosing how to partition the data is important for getting correct results.
wblack wrote:From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage.
That should depend on the processing power of your database server vs. your DataStage server and whether or not DB2 can do a parallel sort.
wblack wrote:Does the join run faster if the data is sorted on the same field?
Join stage requires sorted inputs. If the join inputs are DB2 ordered outputs, then you can set the properties to tell it don't sort; it's already sorted.

Posted: Sat Feb 25, 2012 3:18 pm
by wblack
You say "A Join stage requires sorted inputs. If the join inputs are DB2 ordered outputs, then you can set the properties to tell it don't sort; it's already sorted. "

Ok a join requires sorted input but does that mean if it's not sorted a sort is entered for you? Also does a join require that the sorted input be on a single node? In other words when a join happens does it get repartitioned to a single node?

Is there an environment variable you can use to see if a sort was added by DataStage?

Posted: Sat Feb 25, 2012 3:49 pm
by chulett
Yes, add $APT_DUMP_SCORE to the job and set to true.

Posted: Sun Feb 26, 2012 12:01 pm
by wblack
Ok a join requires sorted input but does that mean if it's not sorted a sort is entered for you? Also does a join require that the sorted input be on a single node? In other words when a join happens does it get repartitioned to a single node?

Posted: Sun Feb 26, 2012 1:35 pm
by ray.wurlod
If there is no indication that the inputs of a Join stage are sorted, then a tsort operator will be inserted. There is no requirement for this to be sequential; each node is managed independently. It is vital, of course, that the data are correctly partitioned (on at least the first join key).

Posted: Sun Feb 26, 2012 10:17 pm
by qt_ky
Try to get it out of your head that everything must happen on a single node. Are you seeing something that leads you to believe these things are running on a single node? It is possible to force sequential behavior in a parallel job, but it's not often going to be desirable.

Posted: Mon Feb 27, 2012 5:36 am
by PhilHibbs
ray.wurlod wrote:It is vital, of course, that the data are correctly partitioned (on at least the first join key).
Why the first? If the join key is HOUSENM and POSTCODE, why can't it be partitioned on POSTCODE?

Posted: Mon Feb 27, 2012 5:38 am
by PhilHibbs
ray.wurlod wrote:It is vital, of course, that the data are correctly partitioned (on at least the first join key).
Why the first? If the join key is HOUSENM and POSTCODE, why can't it be partitioned on POSTCODE?

Or, it could be partitioned by any column that is a superset of POSTCODE, e.g. POSTTOWN, or POSTCODE_AREA (e.g. the "W12" part of "W12 8QT").

Posted: Mon Feb 27, 2012 6:08 am
by wblack
qt_ky wrote:Try to get it out of your head that everything must happen on a single node. Are you seeing something that leads you to believe these things are running on a single node? It is possible to force sequential behavior in a parallel job, but it's not often going to be desirable.
I don't have it in my head that everything needs to run on a single node. I'm working on a job that is extracting a large dataset (12M) and I was was trying to better understand where the bottleneck is happening.

I have an 8-node (4 local, 4 remote) configuration. When my job runs it's worse performance is 8 nodes and as I back the nodes down 7,6,5 the performance improves. When I run 4 local it's the best performance. I was thinking the problem is either with the copying of the datasets across the network or the job is getting repartitioned which pulls the data back to a single node.

Posted: Mon Feb 27, 2012 8:07 am
by PhilHibbs
wblack wrote:I was thinking the problem is either with the copying of the datasets across the network or the job is getting repartitioned which pulls the data back to a single node.
I think that this is not true, that repartitioning pulls the data back to a single node.

Posted: Mon Feb 27, 2012 8:34 am
by priyadarshikunal
@Phil:
Yes it can be partitioned on any of the keys but need to make sure that the data is evenly partitioned. For that you can use multiple keys.

You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.

Also, you are right about repartitioning.

@wblack
Its never true that the max number of nodes will give the best result, you need to find out the best suited number of nodes for your server. Also can there be a network delay causing the performance to degrade when using the remote machines?

Posted: Mon Feb 27, 2012 8:40 am
by PhilHibbs
priyadarshikunal wrote:...You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.
I thought that if I just set the partitioning to "Same", that DataStage would not automatically repartition for any reason. Surely it only does that if you let it with "Auto"?

Posted: Mon Feb 27, 2012 8:46 am
by priyadarshikunal
PhilHibbs wrote:I thought that if I just set the partitioning to "Same", that DataStage would not automatically repartition for any reason. Surely it only does that if you let it with "Auto"?
Yeah, forgot that one. :oops:

Posted: Mon Feb 27, 2012 6:24 pm
by wblack
I have a rather elementary question. If a join expects sorted data and the data isn't sorted it adds a tsort. Is this correct? Also, if a sort stage is added before a join where the data isn't sorted will this make a join perform better?