Dataset on 2nodes read with 4 nodes

deepthi · Post by **deepthi** » Thu Jul 02, 2009 8:07 am

Hi
A dataset is generated on 2 nodes. Can I read the same dataset using 4 nodes? What kind of repartitioning should i have to implement?

How can i view the dataset that is created?

Thanks
Deepthi

chulett · Post by **chulett** » Thu Jul 02, 2009 8:17 am

ArndW · Post by **ArndW** » Thu Jul 02, 2009 8:17 am

The dataset will be read and repartitioned according to whatever method you declare in your job.

[edit]
I was curious as to what would happen so I tested it.

chulett · Post by **chulett** » Thu Jul 02, 2009 8:18 am

So... you're saying a dataset created on two nodes can be read under a four node configuration? Interesting... honestly didn't think that would work.

ArndW · Post by **ArndW** » Thu Jul 02, 2009 8:22 am

Yes, it works. What I was curious about was going the other way and how DS would handle it. I did a quickie 2-node write of 10K records, then did a 4-node read and it did round-robin of 25K records per node.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Jul 02, 2009 8:25 am

Arnd,

I know it is possible to read 2 node ds via 4 but may have issues doing vice-versa.

But can you explain how 10k via 2 node became 25k in 4?

ArndW · Post by **ArndW** » Thu Jul 02, 2009 8:32 am

Oops, I meant 100K records in my original post (I won't edit it).

The other direction, reading a 4 node file with a 2 node configuration also works correctly. I switched my read repartitioning to random and have a 4-node file with 25K records per node read with 2 nodes @ 50008 and 49992 records respectively.

The score shows how the reparitioning got done:

Code: Select all

main_program: This step has 4 datasets:
ds0: {/tmp/aw.ds
      eAny=>eCollectAny
      op0[4p] (parallel input repartition(0))}
ds1: {op0[4p] (parallel input repartition(0))
      eAny#>eCollectAny
      op1[2p] (parallel Data_Set_11)}
ds2: {op1[2p] (parallel Data_Set_11)
      eRandom#>eCollectAny
      op2[2p] (parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)}
ds3: {op2[2p] (parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)
      eAny=>eCollectAny
      op3[2p] (parallel Peek_13)}
It has 4 operators:
op0[4p] {(parallel input repartition(0))
    on nodes (
      node1[op0,p0]
      node2[op0,p1]
      node1[op0,p2]
      node1[op0,p3]
    )}
op1[2p] {(parallel Data_Set_11)
    on nodes (
      node1[op1,p0]
      node2[op1,p1]
    )}
op2[2p] {(parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)
    on nodes (
      node1[op2,p0]
      node2[op2,p1]
    )}
op3[2p] {(parallel Peek_13)
    on nodes (
      node1[op3,p0]
      node2[op3,p1]
    )}
It runs 10 processes on 2 nodes.

deepthi · Post by **deepthi** » Thu Jul 02, 2009 8:37 am

Thank you all for your inputs.

As mentioned by Sainath, if we go from 4 node to 2 node, the issue

that may be raised is irregular size of partitions?

My Another question is, How can I view datasets using a command.

Thanks
Deepthi

Sainath.Srinivasan wrote:Arnd,

I know it is possible to read 2 node ds via 4 but may have issues doing vice-versa.

But can you explain how 10k via 2 node became 25k in 4? ...

ArndW · Post by **ArndW** » Thu Jul 02, 2009 8:46 am

DataStage will automagically repartition from any source partitioning amount to any runtime amount of nodes. You can use the UNIX "orchadmin" command to view partitioning information from the command line.

priyadarshikunal · Post by **priyadarshikunal** » Thu Jul 02, 2009 9:17 am

Yes, it works, and the formal way is to clear the partitioning at input to avoid any strange problems, uneven partitioning, performance problems(most likely) etc. But it works without that too.

I ran a job with 20 million records on 2 nodes and was trying to read on 1 node and waited 1 hr for magic to happen but no luck. Not a single record passed through the output link. Then created the dataset on one node and it took 3 minutes to complete.

Then once again created dataset on 2 nodes, set clear partition on input and then ran the job. This time it worked taking 4 minutes. but still not sure that it will work every time.

or the first approach may work for the second time but can't test it now as I am not on datastage box.

It may also be due to single node and multi-node but I am not sure. Infact I am cofused after looking at mine as well as Arnd's test results. And need more time to analyze it.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Jul 02, 2009 9:23 am

Priyadarshi,

That was the problem I wanted to say with preserving the partition due to which you cannot run a high node dataset in a low node configuration.

Thanks for putting it with practical terms.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 02, 2009 5:35 pm

The Data Set descriptor file includes information about the configuration with which it was written. It is this information that is used when reading the file (though orchadmin dump has an option not to do this).

It gets even more interesting in grid configurations, but they have figured out a way to manage it even when the configuration is dynamic, essentially by using the Data Set's own configuration as a "read only" set of nodes.

balajisr · Post by **balajisr** » Thu Jul 02, 2009 9:34 pm

But i guess this only works when resource disk is common between 2 nodes and 4 nodes in the configuration file.