Page 1 of 1

Partitioning in Datasets

Posted: Wed Jun 23, 2010 6:59 am
by SwathiCh
Hi All,

I have a requirement to join two datasets created in two other jobs. One of the job is running on single node and other is running on multi node.

In both the jobs before creating datasets, data is sorted on key and hashed on key.

In third job I am joining these two datasets on same key with SAME partition before join and I am running job on multi node.

Is it give the correct join results?

My confusion is, I am using a dataset which is created on single node in multi node job with same partition.

Any ideas please?

Posted: Wed Jun 23, 2010 7:02 am
by chulett
You'd be the only one that knows if the "join results" are correct or not. Mixing partitioned datasets like that shouldn't be an issue as far as I know.

Posted: Wed Jun 23, 2010 7:12 am
by SwathiCh
Thanks Chulett,

I am testing this job, it seems that I am getting the correct results but want to confirm with our expertise people here.

If I am creating a hashed dataset on single node means all the records will go into one partition on that node, in other dataset records will be scattered along multi nodes, then how datastage will take care of this join?

Posted: Wed Jun 23, 2010 7:16 am
by chulett
Automatically. :wink:

The number of nodes used to create a dataset does not restrict the number of nodes you can use to read it.

Posted: Wed Jun 23, 2010 7:23 am
by priyadarshikunal
Dataset created by job on single node will not have partition at all as all the key will be on same node. Datastage by default inserts sort and hash partitioning (for join) unless you change the environment variable to force datastage not to do it.

In this case, just to be on safer side, hash partition data to match the dataset created on multiple node, IMHO.

Posted: Wed Jun 23, 2010 7:54 am
by SwathiCh
Thanks Chulett, Priyadarshi,

It means eventhough dataset is created on single node, when reading the same dataset from multinode, datastage automatically insert the tsort and hash operators internally and do a repartition the single node dataset data as per multinode requirements.

Thanks, Good point.

Posted: Wed Jun 23, 2010 5:07 pm
by ray.wurlod
While that is true, it may introduce inefficiencies. For example, is the sort really necessary? (Data Sets preserve sorted order among other things.)