Page 1 of 1

Sorted Hashed Partition in DataSet

Posted: Wed Mar 12, 2008 6:18 am
by JSimon
Hi,

After doing a search, I sifted through the results and couldn't come up with the exact answer I was looking for, so here I am seeing if I can get some discourse about my question.

I want to see if my thinking is correct here. I have some files being read using a complex flat file stage. These are each then being loaded into a DataSet for use in a later job. Because this job will be utilizing the Join stage heavily, I wanted to perform a sorted hash when writing these initial DataSets.

What I'm looking to confirm is...

If I write 5 different DataSets in one Job, all using a sorted hash partition (sorting on the same key), can I read these DataSets in another job and use 'same' partitioning on a Join stage to bring some of these DataSets back together?


Thanks in advance ;)

Jason

Posted: Wed Mar 12, 2008 6:34 am
by ArndW
Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.

Posted: Wed Mar 12, 2008 6:37 am
by JSimon
ArndW wrote:Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.
I guess that's really what my question was. If I use a sorted hash, will this avoid the scenario you described? I was under the assumption that it would, but again, I'm here because I'm not confident in that assumption.

Thanks ArndW

Posted: Wed Mar 12, 2008 6:39 am
by ccatania
If you read a persistent dataset using SAME partitioning, the
downstream stage runs with the degree of parallelism used
to create the dataset, regardless of the current
$APT_CONFIG_FILE / specified node pool.

I hope this help

Posted: Wed Mar 12, 2008 6:52 am
by ArndW
If you have the "same" partitioning, then a join will only go between like partitions; in most cases the data isn't formatted that way and it will lead to data loss during joins. An easy test is to run the same job with a 1-node and an n-node configuration and compare the outputs, if they are not identical you will have discovered a logic error.

Posted: Wed Mar 12, 2008 7:25 am
by JSimon
I'll test it and post the results.

Posted: Wed Mar 12, 2008 5:35 pm
by ray.wurlod
Don't forget that the subsequent jobs will need to use a configuration file that is the same as, or compatible with, the configuration file used to write the Data Set. In particular all the nodes in the writer must be available in the reader.