Partitioning in Datasets

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Partitioning in Datasets

Post by SwathiCh »

Hi All,

I have a requirement to join two datasets created in two other jobs. One of the job is running on single node and other is running on multi node.

In both the jobs before creating datasets, data is sorted on key and hashed on key.

In third job I am joining these two datasets on same key with SAME partition before join and I am running job on multi node.

Is it give the correct join results?

My confusion is, I am using a dataset which is created on single node in multi node job with same partition.

Any ideas please?
--
Swathi Ch
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You'd be the only one that knows if the "join results" are correct or not. Mixing partitioned datasets like that shouldn't be an issue as far as I know.
-craig

"You can never have too many knives" -- Logan Nine Fingers
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Post by SwathiCh »

Thanks Chulett,

I am testing this job, it seems that I am getting the correct results but want to confirm with our expertise people here.

If I am creating a hashed dataset on single node means all the records will go into one partition on that node, in other dataset records will be scattered along multi nodes, then how datastage will take care of this join?
--
Swathi Ch
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Automatically. :wink:

The number of nodes used to create a dataset does not restrict the number of nodes you can use to read it.
-craig

"You can never have too many knives" -- Logan Nine Fingers
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Dataset created by job on single node will not have partition at all as all the key will be on same node. Datastage by default inserts sort and hash partitioning (for join) unless you change the environment variable to force datastage not to do it.

In this case, just to be on safer side, hash partition data to match the dataset created on multiple node, IMHO.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Post by SwathiCh »

Thanks Chulett, Priyadarshi,

It means eventhough dataset is created on single node, when reading the same dataset from multinode, datastage automatically insert the tsort and hash operators internally and do a repartition the single node dataset data as per multinode requirements.

Thanks, Good point.
--
Swathi Ch
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

While that is true, it may introduce inefficiencies. For example, is the sort really necessary? (Data Sets preserve sorted order among other things.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply