Sorted Hashed Partition in DataSet

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
JSimon
Premium Member
Premium Member
Posts: 47
Joined: Fri Dec 14, 2007 8:47 am
Contact:

Sorted Hashed Partition in DataSet

Post by JSimon »

Hi,

After doing a search, I sifted through the results and couldn't come up with the exact answer I was looking for, so here I am seeing if I can get some discourse about my question.

I want to see if my thinking is correct here. I have some files being read using a complex flat file stage. These are each then being loaded into a DataSet for use in a later job. Because this job will be utilizing the Join stage heavily, I wanted to perform a sorted hash when writing these initial DataSets.

What I'm looking to confirm is...

If I write 5 different DataSets in one Job, all using a sorted hash partition (sorting on the same key), can I read these DataSets in another job and use 'same' partitioning on a Join stage to bring some of these DataSets back together?


Thanks in advance ;)

Jason
-----
Jason Simon
Consultant / Developer
Attevo, Inc.

http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.
JSimon
Premium Member
Premium Member
Posts: 47
Joined: Fri Dec 14, 2007 8:47 am
Contact:

Post by JSimon »

ArndW wrote:Yes, it will keep the same partitions - but you need to ensure that a join on Group 1 from DataSet 1 doesn't link to a value in Group 2 on another DataSet.
I guess that's really what my question was. If I use a sorted hash, will this avoid the scenario you described? I was under the assumption that it would, but again, I'm here because I'm not confident in that assumption.

Thanks ArndW
-----
Jason Simon
Consultant / Developer
Attevo, Inc.

http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
ccatania
Premium Member
Premium Member
Posts: 68
Joined: Thu Sep 08, 2005 5:42 am
Location: Raleigh
Contact:

Post by ccatania »

If you read a persistent dataset using SAME partitioning, the
downstream stage runs with the degree of parallelism used
to create the dataset, regardless of the current
$APT_CONFIG_FILE / specified node pool.

I hope this help
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If you have the "same" partitioning, then a join will only go between like partitions; in most cases the data isn't formatted that way and it will lead to data loss during joins. An easy test is to run the same job with a 1-node and an n-node configuration and compare the outputs, if they are not identical you will have discovered a logic error.
JSimon
Premium Member
Premium Member
Posts: 47
Joined: Fri Dec 14, 2007 8:47 am
Contact:

Post by JSimon »

I'll test it and post the results.
-----
Jason Simon
Consultant / Developer
Attevo, Inc.

http://www.attevo.com/
http://www.projectjlm.com/
http://www.elefoo.com/
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't forget that the subsequent jobs will need to use a configuration file that is the same as, or compatible with, the configuration file used to write the Data Set. In particular all the nodes in the writer must be available in the reader.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply