Hi,
After doing a search, I sifted through the results and couldn't come up with the exact answer I was looking for, so here I am seeing if I can get some discourse about my question.
I want to see if my thinking is correct here. I have some files being read using a complex flat file stage. These are each then being loaded into a DataSet for use in a later job. Because this job will be utilizing the Join stage heavily, I wanted to perform a sorted hash when writing these initial DataSets.
What I'm looking to confirm is...
If I write 5 different DataSets in one Job, all using a sorted hash partition (sorting on the same key), can I read these DataSets in another job and use 'same' partitioning on a Join stage to bring some of these DataSets back together?
Thanks in advance
Jason
Sorted Hashed Partition in DataSet
Moderators: chulett, rschirm, roy
Sorted Hashed Partition in DataSet
-----
Jason Simon
Consultant / Developer
Attevo, Inc.
http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
Jason Simon
Consultant / Developer
Attevo, Inc.
http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I guess that's really what my question was. If I use a sorted hash, will this avoid the scenario you described? I was under the assumption that it would, but again, I'm here because I'm not confident in that assumption.ArndW wrote:Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.
Thanks ArndW
-----
Jason Simon
Consultant / Developer
Attevo, Inc.
http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
Jason Simon
Consultant / Developer
Attevo, Inc.
http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
If you have the "same" partitioning, then a join will only go between like partitions; in most cases the data isn't formatted that way and it will lead to data loss during joins. An easy test is to run the same job with a 1-node and an n-node configuration and compare the outputs, if they are not identical you will have discovered a logic error.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I'll test it and post the results.
-----
Jason Simon
Consultant / Developer
Attevo, Inc.
http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
Jason Simon
Consultant / Developer
Attevo, Inc.
http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Don't forget that the subsequent jobs will need to use a configuration file that is the same as, or compatible with, the configuration file used to write the Data Set. In particular all the nodes in the writer must be available in the reader.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.