Partitioning in Datasets

SwathiCh · Post by **SwathiCh** » Wed Jun 23, 2010 6:59 am

Hi All,

I have a requirement to join two datasets created in two other jobs. One of the job is running on single node and other is running on multi node.

In both the jobs before creating datasets, data is sorted on key and hashed on key.

In third job I am joining these two datasets on same key with SAME partition before join and I am running job on multi node.

Is it give the correct join results?

My confusion is, I am using a dataset which is created on single node in multi node job with same partition.

Any ideas please?

chulett · Post by **chulett** » Wed Jun 23, 2010 7:02 am

You'd be the only one that knows if the "join results" are correct or not. Mixing partitioned datasets like that shouldn't be an issue as far as I know.

SwathiCh · Post by **SwathiCh** » Wed Jun 23, 2010 7:12 am

Thanks Chulett,

I am testing this job, it seems that I am getting the correct results but want to confirm with our expertise people here.

If I am creating a hashed dataset on single node means all the records will go into one partition on that node, in other dataset records will be scattered along multi nodes, then how datastage will take care of this join?

chulett · Post by **chulett** » Wed Jun 23, 2010 7:16 am

Automatically.

The number of nodes used to create a dataset does not restrict the number of nodes you can use to read it.

priyadarshikunal · Post by **priyadarshikunal** » Wed Jun 23, 2010 7:23 am

Dataset created by job on single node will not have partition at all as all the key will be on same node. Datastage by default inserts sort and hash partitioning (for join) unless you change the environment variable to force datastage not to do it.

In this case, just to be on safer side, hash partition data to match the dataset created on multiple node, IMHO.

SwathiCh · Post by **SwathiCh** » Wed Jun 23, 2010 7:54 am

Thanks Chulett, Priyadarshi,

It means eventhough dataset is created on single node, when reading the same dataset from multinode, datastage automatically insert the tsort and hash operators internally and do a repartition the single node dataset data as per multinode requirements.

Thanks, Good point.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 23, 2010 5:07 pm

While that is true, it may introduce inefficiencies. For example, is the sort really necessary? (Data Sets preserve sorted order among other things.)