Dataset on 2nodes read with 4 nodes

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
deepthi
Participant
Posts: 56
Joined: Thu Apr 28, 2005 9:52 am

Dataset on 2nodes read with 4 nodes

Post by deepthi »

Hi
A dataset is generated on 2 nodes. Can I read the same dataset using 4 nodes? What kind of repartitioning should i have to implement?

How can i view the dataset that is created?

Thanks
Deepthi
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The dataset will be read and repartitioned according to whatever method you declare in your job.

[edit]
I was curious as to what would happen so I tested it.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So... you're saying a dataset created on two nodes can be read under a four node configuration? Interesting... honestly didn't think that would work.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Yes, it works. What I was curious about was going the other way and how DS would handle it. I did a quickie 2-node write of 10K records, then did a 4-node read and it did round-robin of 25K records per node.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Arnd,

I know it is possible to read 2 node ds via 4 but may have issues doing vice-versa.

But can you explain how 10k via 2 node became 25k in 4?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Oops, I meant 100K records in my original post (I won't edit it).

The other direction, reading a 4 node file with a 2 node configuration also works correctly. I switched my read repartitioning to random and have a 4-node file with 25K records per node read with 2 nodes @ 50008 and 49992 records respectively.

The score shows how the reparitioning got done:

Code: Select all

main_program: This step has 4 datasets:
ds0: {/tmp/aw.ds
      eAny=>eCollectAny
      op0[4p] (parallel input repartition(0))}
ds1: {op0[4p] (parallel input repartition(0))
      eAny#>eCollectAny
      op1[2p] (parallel Data_Set_11)}
ds2: {op1[2p] (parallel Data_Set_11)
      eRandom#>eCollectAny
      op2[2p] (parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)}
ds3: {op2[2p] (parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)
      eAny=>eCollectAny
      op3[2p] (parallel Peek_13)}
It has 4 operators:
op0[4p] {(parallel input repartition(0))
    on nodes (
      node1[op0,p0]
      node2[op0,p1]
      node1[op0,p2]
      node1[op0,p3]
    )}
op1[2p] {(parallel Data_Set_11)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[2p] {(parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
    )}
op3[2p] {(parallel Peek_13)
    on nodes (
      node1[op3,p0]
      node2[op3,p1]
    )}
It runs 10 processes on 2 nodes.
deepthi
Participant
Posts: 56
Joined: Thu Apr 28, 2005 9:52 am

Post by deepthi »

Thank you all for your inputs.

As mentioned by Sainath, if we go from 4 node to 2 node, the issue

that may be raised is irregular size of partitions?

My Another question is, How can I view datasets using a command.

Thanks
Deepthi


Sainath.Srinivasan wrote:Arnd,

I know it is possible to read 2 node ds via 4 but may have issues doing vice-versa.

But can you explain how 10k via 2 node became 25k in 4? ...
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

DataStage will automagically repartition from any source partitioning amount to any runtime amount of nodes. You can use the UNIX "orchadmin" command to view partitioning information from the command line.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Yes, it works, and the formal way is to clear the partitioning at input to avoid any strange problems, uneven partitioning, performance problems(most likely) etc. But it works without that too.

I ran a job with 20 million records on 2 nodes and was trying to read on 1 node and waited 1 hr for magic to happen but no luck. Not a single record passed through the output link. Then created the dataset on one node and it took 3 minutes to complete.

Then once again created dataset on 2 nodes, set clear partition on input and then ran the job. This time it worked taking 4 minutes. but still not sure that it will work every time.

or the first approach may work for the second time but can't test it now as I am not on datastage box.

It may also be due to single node and multi-node but I am not sure. Infact I am cofused after looking at mine as well as Arnd's test results. And need more time to analyze it.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Priyadarshi,

That was the problem I wanted to say with preserving the partition due to which you cannot run a high node dataset in a low node configuration.

Thanks for putting it with practical terms.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Data Set descriptor file includes information about the configuration with which it was written. It is this information that is used when reading the file (though orchadmin dump has an option not to do this).

It gets even more interesting in grid configurations, but they have figured out a way to manage it even when the configuration is dynamic, essentially by using the Data Set's own configuration as a "read only" set of nodes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

But i guess this only works when resource disk is common between 2 nodes and 4 nodes in the configuration file. :roll:
Post Reply