SQLSrvr1 SQLSrvr2
| |
| |
SRC_SQLSrvr-->TFM-->Join1------------------->Join2--------------->Dataset
SRC_SQLSrvr -- Auto Partition
TFM -- Auto
Join1 & SQLSrvr1 -- Hash , BK sort
Join2 & SQLSrvr2 -- SAME , BK sort
Dataset -- SAME, BK Sort (Which is going to be the INPUT KEY for next job)
I used 4 node Dataset file.
When I ran this job, it went smoothly, whereas in the Dataset (Dataset management) I can see the data only in one node and the rest of the nodes are not having any data?
Dose any one can find anything?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
SQLSrvr1 SQLSrvr2
| |
| |
SRC_SQLSrvr-->TFM-->Join1------------------->Join2--------------->Dataset
SRC_SQLSrvr -- Auto Partition
TFM -- Auto
Join1 & SQLSrvr1 -- Hash , BK sort
Join2 & SQLSrvr2 -- SAME , BK sort
Dataset -- SAME, BK Sort (Which is going to be the INPUT KEY for next job)
Source is SQL Server - AUTO
TFM - AUTO
First Join - Hash, SORT on BK for that JOIN
FIRST REF SQL SRVE - Hash, SORT on BK for that JOIN
2nd Join - SAME, SORT on BK for that JOIN
2nd SQL Srvr - HASH, SORT on BK for that JOIN
TARGET DATASET - SAME, SORT on BK for FOR NEXT JOB
In here BK for that JOIN mean each join stage will have different BKs.
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
A key-based partitioning algorithm doesn't guarantee that data will be evenly distributed among the partitions or that data will be sent to all partitions. It merely guarantees that all records with containing the same (identical) value for a key will be sent to the same partition.
Do you have large numbers of records which have the same value for the key?
Is there a reason you are NOT partitioning by Column B for Join 2? It is your primary key for that join, after all
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
and one more solution could be using a round robin partitioning Ofcourse it degrades the performance in this senario but it makes sure that all the rows are equally distuributed to all nodes, try this once,
IMP NOTE: if the issue is resolved please mark it as resolved this help others and saves time.
Ulab----------------------------------------------------
help, it helps you today or Tomorrow
If a downstream job needs the data to be key-partitioned, then having 0 rows in one or more partitions of a dataset is not necessarily a bad situation. It's sometimes the price to pay at a point in the process to meet the business rule/logic requirements. If a downstream job doesn't require a particular partitioning scheme, then you can look at round-robin or random partitioning prior to writing out the dataset. Don't round-robin for the joins (I hope that wasn't your suggestion)--that would go against the requirements for the join stage.
Still, I question the use of SAME partitioning prior to the second join...WHY?? The framework could still be inserting a hash partition on column B at runtime to meet the requirement of the join key, when the correct partitioning should really be reflected in the job design instead.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
jwiles wrote:Still, I question the use of SAME partitioning prior to the second join...WHY?? The framework could still be inserting a hash partition on column B at runtime to meet the requirement of the join key, when the correct partitioning should really be reflected in the job design instead.
Regards,
Thanks for the reply.
Do you mean, i need to use the HASH partition explicitly rather than SAME in the JOIN2 ?
When the chek the DUMP_SCORE, it is internally placing the HASH.
What is the difference between using SAME Vs explicitly choosing the partition?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
Read this documentation about Same partitioning. You should then understand what Same partitioning does and how it differs from other partitioning.
Had partition insertion been disabled in your job or project, your second join would not have produced the results you required because your data would not have been partitioned correctly for that join operation.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
You CAN use same partitioning, but you don't have to. You could leave it as Auto instead and DataStage would more than likely not repartition the data because it already meets the needs of the downstream stage.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
This question was already answered earlier in the thread:
Posted: Thu Aug 30, 2012 7:03 am
A key-based partitioning algorithm doesn't guarantee that data will be evenly distributed among the partitions or that data will be sent to all partitions. It merely guarantees that all records with containing the same (identical) value for a key will be sent to the same partition.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.