Sorted Hashed Partition in DataSet

JSimon · Post by **JSimon** » Wed Mar 12, 2008 6:18 am

Hi,

After doing a search, I sifted through the results and couldn't come up with the exact answer I was looking for, so here I am seeing if I can get some discourse about my question.

I want to see if my thinking is correct here. I have some files being read using a complex flat file stage. These are each then being loaded into a DataSet for use in a later job. Because this job will be utilizing the Join stage heavily, I wanted to perform a sorted hash when writing these initial DataSets.

What I'm looking to confirm is...

If I write 5 different DataSets in one Job, all using a sorted hash partition (sorting on the same key), can I read these DataSets in another job and use 'same' partitioning on a Join stage to bring some of these DataSets back together?

Thanks in advance

Jason

ArndW · Post by **ArndW** » Wed Mar 12, 2008 6:34 am

Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.

JSimon · Post by **JSimon** » Wed Mar 12, 2008 6:37 am

ArndW wrote:Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.

I guess that's really what my question was. If I use a sorted hash, will this avoid the scenario you described? I was under the assumption that it would, but again, I'm here because I'm not confident in that assumption.

Thanks ArndW

ccatania · Post by **ccatania** » Wed Mar 12, 2008 6:39 am

If you read a persistent dataset using SAME partitioning, the
downstream stage runs with the degree of parallelism used
to create the dataset, regardless of the current
$APT_CONFIG_FILE / specified node pool.

I hope this help

ArndW · Post by **ArndW** » Wed Mar 12, 2008 6:52 am

If you have the "same" partitioning, then a join will only go between like partitions; in most cases the data isn't formatted that way and it will lead to data loss during joins. An easy test is to run the same job with a 1-node and an n-node configuration and compare the outputs, if they are not identical you will have discovered a logic error.

JSimon · Post by **JSimon** » Wed Mar 12, 2008 7:25 am

I'll test it and post the results.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 12, 2008 5:35 pm

Don't forget that the subsequent jobs will need to use a configuration file that is the same as, or compatible with, the configuration file used to write the Data Set. In particular all the nodes in the writer must be available in the reader.