Dataset on 2nodes read with 4 nodes
Moderators: chulett, rschirm, roy
Dataset on 2nodes read with 4 nodes
Hi
A dataset is generated on 2 nodes. Can I read the same dataset using 4 nodes? What kind of repartitioning should i have to implement?
How can i view the dataset that is created?
Thanks
Deepthi
A dataset is generated on 2 nodes. Can I read the same dataset using 4 nodes? What kind of repartitioning should i have to implement?
How can i view the dataset that is created?
Thanks
Deepthi
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
Oops, I meant 100K records in my original post (I won't edit it).
The other direction, reading a 4 node file with a 2 node configuration also works correctly. I switched my read repartitioning to random and have a 4-node file with 25K records per node read with 2 nodes @ 50008 and 49992 records respectively.
The score shows how the reparitioning got done:
The other direction, reading a 4 node file with a 2 node configuration also works correctly. I switched my read repartitioning to random and have a 4-node file with 25K records per node read with 2 nodes @ 50008 and 49992 records respectively.
The score shows how the reparitioning got done:
Code: Select all
main_program: This step has 4 datasets:
ds0: {/tmp/aw.ds
eAny=>eCollectAny
op0[4p] (parallel input repartition(0))}
ds1: {op0[4p] (parallel input repartition(0))
eAny#>eCollectAny
op1[2p] (parallel Data_Set_11)}
ds2: {op1[2p] (parallel Data_Set_11)
eRandom#>eCollectAny
op2[2p] (parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)}
ds3: {op2[2p] (parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)
eAny=>eCollectAny
op3[2p] (parallel Peek_13)}
It has 4 operators:
op0[4p] {(parallel input repartition(0))
on nodes (
node1[op0,p0]
node2[op0,p1]
node1[op0,p2]
node1[op0,p3]
)}
op1[2p] {(parallel Data_Set_11)
on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[2p] {(parallel APT_TransformOperatorImplV0S7_Read2NodeAs4_Transformer_7 in Transformer_7)
on nodes (
node1[op2,p0]
node2[op2,p1]
)}
op3[2p] {(parallel Peek_13)
on nodes (
node1[op3,p0]
node2[op3,p1]
)}
It runs 10 processes on 2 nodes.
Thank you all for your inputs.
As mentioned by Sainath, if we go from 4 node to 2 node, the issue
that may be raised is irregular size of partitions?
My Another question is, How can I view datasets using a command.
Thanks
Deepthi
As mentioned by Sainath, if we go from 4 node to 2 node, the issue
that may be raised is irregular size of partitions?
My Another question is, How can I view datasets using a command.
Thanks
Deepthi
Sainath.Srinivasan wrote:Arnd,
I know it is possible to read 2 node ds via 4 but may have issues doing vice-versa.
But can you explain how 10k via 2 node became 25k in 4? ...
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
Yes, it works, and the formal way is to clear the partitioning at input to avoid any strange problems, uneven partitioning, performance problems(most likely) etc. But it works without that too.
I ran a job with 20 million records on 2 nodes and was trying to read on 1 node and waited 1 hr for magic to happen but no luck. Not a single record passed through the output link. Then created the dataset on one node and it took 3 minutes to complete.
Then once again created dataset on 2 nodes, set clear partition on input and then ran the job. This time it worked taking 4 minutes. but still not sure that it will work every time.
or the first approach may work for the second time but can't test it now as I am not on datastage box.
It may also be due to single node and multi-node but I am not sure. Infact I am cofused after looking at mine as well as Arnd's test results. And need more time to analyze it.
I ran a job with 20 million records on 2 nodes and was trying to read on 1 node and waited 1 hr for magic to happen but no luck. Not a single record passed through the output link. Then created the dataset on one node and it took 3 minutes to complete.
Then once again created dataset on 2 nodes, set clear partition on input and then ran the job. This time it worked taking 4 minutes. but still not sure that it will work every time.
or the first approach may work for the second time but can't test it now as I am not on datastage box.
It may also be due to single node and multi-node but I am not sure. Infact I am cofused after looking at mine as well as Arnd's test results. And need more time to analyze it.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.![Wink :wink:](./images/smilies/icon_wink.gif)
Genius may have its limitations, but stupidity is not thus handicapped.
![Wink :wink:](./images/smilies/icon_wink.gif)
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The Data Set descriptor file includes information about the configuration with which it was written. It is this information that is used when reading the file (though orchadmin dump has an option not to do this).
It gets even more interesting in grid configurations, but they have figured out a way to manage it even when the configuration is dynamic, essentially by using the Data Set's own configuration as a "read only" set of nodes.
It gets even more interesting in grid configurations, but they have figured out a way to manage it even when the configuration is dynamic, essentially by using the Data Set's own configuration as a "read only" set of nodes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.