Reading & Writing using different configuration file

nagarjuna · Post by **nagarjuna** » Wed Jan 18, 2012 4:32 pm

I have 2 jobs

Job 1 : I have created a dataset : dataset_8node using 8 node configuration file .While creating it sorted & hash partitioned on Key1

Job 2 : I am joining the dataset_8node with a oracle table

Code: Select all

      dataset_8node   ----> same partition Sort ( Don't sort previously sorted ) 

                                         |
                                         V

                                    JOIN STAGE ( same partition on both input links )  ------> output
                                         ^
                                         |
                                                     
              Oracle table ---> hash partition & sort on Key1 .

Suppose I am running the job2 on 4 nodes , will there be any unexpected results ?

Looking for your inputs .Thanks in advance

nagarjuna · Post by **nagarjuna** » Wed Jan 18, 2012 4:37 pm

Here I created dataset_8node using 8 node configuration file & reading in a job with 4 node . However , I have given same partition while reading in job 2 . So , Here we have 2 input links to join , 1 input link is created by 8 node & another input link is created by 4 node . I am curious to know how it works.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 18, 2012 4:59 pm

When you read a Data Set and the configuration file you're using is not the one with which the Data Set was written, a temporary configuration file is used by the copy operator that reads from the Data Set.

You can achieve the same with the -x option of the orchadmin command.

nagarjuna · Post by **nagarjuna** » Wed Jan 18, 2012 6:40 pm

Thanks for your response Ray . I understand that the dataset will be read on 8 node because of temp config file that you talked about .
Now My question is how the other link of the join works ? The other link is reading from a oracle table , hash partitioning on key1 . This operation takes place on 4 or 8 node ??

ray.wurlod · Post by **ray.wurlod** » Wed Jan 18, 2012 8:41 pm

That partitioning is occurring on the input link of the Join stage and therefore will use the job's current configuration file.

nagarjuna · Post by **nagarjuna** » Wed Jan 18, 2012 11:49 pm

I have executed the job & it looks like other input link to join stage ( Read from oracle table ) is also running on 8 node as opposed to 4-node . Rest of the downstages after join are running on 4 node . Any idea why this is happening ?

ray.wurlod · Post by **ray.wurlod** » Thu Jan 19, 2012 2:43 am

How do you know that?

Dump the score to learn definitively which operators are processing on which nodes.

nagarjuna · Post by **nagarjuna** » Thu Jan 19, 2012 5:52 am

I have checked the Job monitor & in that I found 8 - instances for sort before join's 2nd input link ( reading from oracle table & sorting ) .

let us suppose , It's should run on 4 node ...Then I think join operation won't work properly as one input link ( dataset ) runs on 8 node & another input runs on 4 node . please note that i have mentioned same partitioning on both the input links of join stage .

Any thoughts on this ?

ray.wurlod · Post by **ray.wurlod** » Thu Jan 19, 2012 3:01 pm

You can always use node pools to force things to run on the nodes you require.

nagarjuna · Post by **nagarjuna** » Thu Jan 19, 2012 3:53 pm

Ray , I understand that we can constrain a stage to run a particular node by defining node pool constraints ...Here my question is how input2 to the join is running on 8 node even after specifying APT_CONFIG_FILE to 4 node in the job parameters .

ray.wurlod · Post by **ray.wurlod** » Thu Jan 19, 2012 5:56 pm

Was the Data Set that feeds it written with an 8-node configuration file?

nagarjuna · Post by **nagarjuna** » Thu Jan 19, 2012 7:31 pm

Yes the dataset is created on 8 node . While reading from that dataset , it is reading 8 node ( even though job executed on 4 node as there is temp config file that you mentioned ) .

I am wondering why the other link to the join is also executed on 8 node till join stage ( Oracle --> sort ( 8 instances ) ---> 2nd input of join stage .