Page 1 of 1
Reading & Writing using different configuration file
Posted: Wed Jan 18, 2012 4:32 pm
by nagarjuna
I have 2 jobs
Job 1 : I have created a dataset : dataset_8node using 8 node configuration file .While creating it sorted & hash partitioned on Key1
Job 2 : I am joining the dataset_8node with a oracle table
Code: Select all
dataset_8node ----> same partition Sort ( Don't sort previously sorted )
|
V
JOIN STAGE ( same partition on both input links ) ------> output
^
|
Oracle table ---> hash partition & sort on Key1 .
Suppose I am running the job2 on 4 nodes , will there be any unexpected results ?
Looking for your inputs .Thanks in advance
Posted: Wed Jan 18, 2012 4:37 pm
by nagarjuna
Here I created dataset_8node using 8 node configuration file & reading in a job with 4 node . However , I have given same partition while reading in job 2 . So , Here we have 2 input links to join , 1 input link is created by 8 node & another input link is created by 4 node . I am curious to know how it works.
Posted: Wed Jan 18, 2012 4:59 pm
by ray.wurlod
When you read a Data Set and the configuration file you're using is not the one with which the Data Set was written, a temporary configuration file is used by the copy operator that reads from the Data Set.
You can achieve the same with the -x option of the orchadmin command.
Posted: Wed Jan 18, 2012 6:40 pm
by nagarjuna
Thanks for your response Ray . I understand that the dataset will be read on 8 node because of temp config file that you talked about .
Now My question is how the other link of the join works ? The other link is reading from a oracle table , hash partitioning on key1 . This operation takes place on 4 or 8 node ??
Posted: Wed Jan 18, 2012 8:41 pm
by ray.wurlod
That partitioning is occurring on the input link of the Join stage and therefore will use the job's current configuration file.
Posted: Wed Jan 18, 2012 11:49 pm
by nagarjuna
I have executed the job & it looks like other input link to join stage ( Read from oracle table ) is also running on 8 node as opposed to 4-node . Rest of the downstages after join are running on 4 node . Any idea why this is happening ?
Posted: Thu Jan 19, 2012 2:43 am
by ray.wurlod
How do you know that?
Dump the score to learn definitively which operators are processing on which nodes.
Posted: Thu Jan 19, 2012 5:52 am
by nagarjuna
I have checked the Job monitor & in that I found 8 - instances for sort before join's 2nd input link ( reading from oracle table & sorting ) .
let us suppose , It's should run on 4 node ...Then I think join operation won't work properly as one input link ( dataset ) runs on 8 node & another input runs on 4 node . please note that i have mentioned same partitioning on both the input links of join stage .
Any thoughts on this ?
Posted: Thu Jan 19, 2012 3:01 pm
by ray.wurlod
You can always use node pools to force things to run on the nodes you require.
Posted: Thu Jan 19, 2012 3:53 pm
by nagarjuna
Ray , I understand that we can constrain a stage to run a particular node by defining node pool constraints ...Here my question is how input2 to the join is running on 8 node even after specifying APT_CONFIG_FILE to 4 node in the job parameters .
Posted: Thu Jan 19, 2012 5:56 pm
by ray.wurlod
Was the Data Set that feeds it written with an 8-node configuration file?
Posted: Thu Jan 19, 2012 7:31 pm
by nagarjuna
Yes the dataset is created on 8 node . While reading from that dataset , it is reading 8 node ( even though job executed on 4 node as there is temp config file that you mentioned ) .
I am wondering why the other link to the join is also executed on 8 node till join stage ( Oracle --> sort ( 8 instances ) ---> 2nd input of join stage .