Sort Stage, Stage Collector sort
Moderators: chulett, rschirm, roy
Sort Stage, Stage Collector sort
I have some general questions regarding the sort stage the the collector sort properties of stages. Is it safe to assume that using the sort stage to sort on a key is equivalent to using the collector stable sort of lets say the transformer stage? What are the main differences/similarities of the two?
Let's assume we are running an 8-node configuration. Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?
From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage. Does this sound about right?
Lets also say we have a join coming out of a DB2 stage that's being joined with another table. Does the join run faster if the data is sorted on the same field? I assume so. This goes back to the original question about using a sort stage verses the collector stable sort of using the DB2 query to do a group by.
Let's assume we are running an 8-node configuration. Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?
From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage. Does this sound about right?
Lets also say we have a join coming out of a DB2 stage that's being joined with another table. Does the join run faster if the data is sorted on the same field? I assume so. This goes back to the original question about using a sort stage verses the collector stable sort of using the DB2 query to do a group by.
William Black
Re: Sort Stage, Stage Collector sort
No. Unless you go out of your way to force sequential sorting (like using a sort merge collector), then the sorting is done in parallel. Each node sorts its own partition or subgroup of the data set. Therefore, choosing how to partition the data is important for getting correct results.wblack wrote:Is it true that when you perform a sort that all nodes are repartitioned back to a single node the collector and then once sorted the data is sent back out the original 8-node configuration?
That should depend on the processing power of your database server vs. your DataStage server and whether or not DB2 can do a parallel sort.wblack wrote:From my experience if you are pulling from a table that has an index on the field your going to sort on its better to just so an order by in the DB2 stage.
Join stage requires sorted inputs. If the join inputs are DB2 ordered outputs, then you can set the properties to tell it don't sort; it's already sorted.wblack wrote:Does the join run faster if the data is sorted on the same field?
Choose a job you love, and you will never have to work a day in your life. - Confucius
You say "A Join stage requires sorted inputs. If the join inputs are DB2 ordered outputs, then you can set the properties to tell it don't sort; it's already sorted. "
Ok a join requires sorted input but does that mean if it's not sorted a sort is entered for you? Also does a join require that the sorted input be on a single node? In other words when a join happens does it get repartitioned to a single node?
Is there an environment variable you can use to see if a sort was added by DataStage?
Ok a join requires sorted input but does that mean if it's not sorted a sort is entered for you? Also does a join require that the sorted input be on a single node? In other words when a join happens does it get repartitioned to a single node?
Is there an environment variable you can use to see if a sort was added by DataStage?
William Black
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If there is no indication that the inputs of a Join stage are sorted, then a tsort operator will be inserted. There is no requirement for this to be sequential; each node is managed independently. It is vital, of course, that the data are correctly partitioned (on at least the first join key).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Try to get it out of your head that everything must happen on a single node. Are you seeing something that leads you to believe these things are running on a single node? It is possible to force sequential behavior in a parallel job, but it's not often going to be desirable.
Choose a job you love, and you will never have to work a day in your life. - Confucius
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
Why the first? If the join key is HOUSENM and POSTCODE, why can't it be partitioned on POSTCODE?ray.wurlod wrote:It is vital, of course, that the data are correctly partitioned (on at least the first join key).
Or, it could be partitioned by any column that is a superset of POSTCODE, e.g. POSTTOWN, or POSTCODE_AREA (e.g. the "W12" part of "W12 8QT").
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
I don't have it in my head that everything needs to run on a single node. I'm working on a job that is extracting a large dataset (12M) and I was was trying to better understand where the bottleneck is happening.qt_ky wrote:Try to get it out of your head that everything must happen on a single node. Are you seeing something that leads you to believe these things are running on a single node? It is possible to force sequential behavior in a parallel job, but it's not often going to be desirable.
I have an 8-node (4 local, 4 remote) configuration. When my job runs it's worse performance is 8 nodes and as I back the nodes down 7,6,5 the performance improves. When I run 4 local it's the best performance. I was thinking the problem is either with the copying of the datasets across the network or the job is getting repartitioned which pulls the data back to a single node.
William Black
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
I think that this is not true, that repartitioning pulls the data back to a single node.wblack wrote:I was thinking the problem is either with the copying of the datasets across the network or the job is getting repartitioned which pulls the data back to a single node.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
@Phil:
Yes it can be partitioned on any of the keys but need to make sure that the data is evenly partitioned. For that you can use multiple keys.
You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.
Also, you are right about repartitioning.
@wblack
Its never true that the max number of nodes will give the best result, you need to find out the best suited number of nodes for your server. Also can there be a network delay causing the performance to degrade when using the remote machines?
Yes it can be partitioned on any of the keys but need to make sure that the data is evenly partitioned. For that you can use multiple keys.
You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.
Also, you are right about repartitioning.
@wblack
Its never true that the max number of nodes will give the best result, you need to find out the best suited number of nodes for your server. Also can there be a network delay causing the performance to degrade when using the remote machines?
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
I thought that if I just set the partitioning to "Same", that DataStage would not automatically repartition for any reason. Surely it only does that if you let it with "Auto"?priyadarshikunal wrote:...You also need to see the processing logic, datastage checks for the partitioning based on the keys defined and validates the partition mechanism. and to avoid any repartitioning either set APT_NO_PART_INSERTION or partition with the keys in same order as join. I think.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
Yeah, forgot that one.PhilHibbs wrote:I thought that if I just set the partitioning to "Same", that DataStage would not automatically repartition for any reason. Surely it only does that if you let it with "Auto"?
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.