Configuration-N-partitioning questions

zaino22 · Post by **zaino22** » Wed Dec 12, 2012 5:58 pm

Some questions are bothering me and after spending some time, I thought I should ask you guys for enlightenment. It will help me immensely.
No these are not interview stumpers.

I am told that new configuration might have have different output (assume if developer used AUTO partitioning) even if we run on ONE node (with 4 resource and scratch disk in apt file) since data is partitioned in 4 scratch and resource disk space.
Notice new configuration have separate disk and resource space while Legacy system has one directory (ISAPP) for scratch and disk space.

My questions are:
-------------------
1) is it true that even with one node, we may have different results since data is partitioned in four resource and scratch disk, assume data is AUTO partition. In other words, I thought having separate resource and scratch disk is for efficiency not impacting data partitioning, and number of nodes usually impact results? So if we update APT file to only One node, and still use 4 resource/scratch disks since we have one node, it will be sequential regardless of number of resource and scratch disk used, am I correct?

2) IBM best practice guide say use Auto to let system decide best partioning option but from experience I know this option doesn't always turn out right results. What do you guys recommend?

3) which of the following Job is efficient, and follows best practice?
Job A is designed with Join stage, and inside Join stage we use Hash partitioning on key and then sort it on link before joining.

Job B is designed with sort stage just before join stage, where we sort the output before going into the join stage. in Join stage, we Hash sort it.

4) When we look at DUMP score, we see tsort operator is inserted in some stages. One experienced fellow told me even though score tells you that, don't trust it because data may not be sorted properly, he recommended not to use AUTO partitioning, and use use other partitioning options based on Data and then sort it yourself.

Following relates to question# 1

Legacy configuration:
2 node one server environment running AIX and DS 8.1

Code: Select all

{ 
        node "node0" 
        { 
                fastname "primary" 
                 pools "" 
               resource disk "/isApp0/dataset/project#" {pools ""} 
               resource disk "/isApp1/dataset/project#"  {pools ""} 
      
            
               resource scratchdisk "/isApp0/Scratch/project#" {pools ""} 
               resource scratchdisk "/isApp1/Scratch/project#" {pools ""} 
     

        } 

                 node "node0" 
        { 
                fastname "primary" 
                pools "" 
                resource disk "/isApp0/dataset/project#" {pools ""} 
                resource disk "/isApp1/dataset/project#"  {pools ""} 
                   
                resource scratchdisk "/isApp0/Scratch/project#" {pools ""} 
                resource scratchdisk "/isApp1/Scratch/project#" {pools ""} 

        } 
            
}

New cluster configuration:
two server, 4 nodes each, config file may not use all nodes. One is primary server and has local scratch space, resource disk, WAS, and DB2 and other is compute server with only IIS engine and local scratch space, disk space is mounted, running AIX and DS 8.7.

Code: Select all

{ 
        node "node1" 
        { 
                fastname "primary" 
                pools "" 
                resource disk "/isdataset0/dataset/project#" {pools ""} 
                resource disk "/isdataset1/dataset/project#"  {pools ""} 
                resource disk "/isdataset2/dataset/project#"  {pools ""} 
                resource disk "/isdataset3/dataset/project#"  {pools ""} 
            
                resource scratchdisk "/isscratch0/Scratch/project#" {pools ""} 
                resource scratchdisk "/isscratch1/Scratch/project#" {pools ""} 
                resource scratchdisk "/isscratch2/Scratch/project#" {pools ""} 
                resource scratchdisk "/isscratch3/Scratch/project#" {pools ""} 

        } 
       
        node "node2" 
        { 
                fastname "compute" 
                pools "" 
                resource disk "/isdataset0/dataset/project#" {pools ""} 
                resource disk "/isdataset1/dataset/project#"  {pools ""} 
                resource disk "/isdataset2/dataset/project#"  {pools ""} 
                resource disk "/isdataset3/dataset/project#"  {pools ""} 
            
                resource scratchdisk "/isscratch0/Scratch/project#" {pools ""} 
                resource scratchdisk "/isscratch1/Scratch/project#" {pools ""} 
                resource scratchdisk "/isscratch2/Scratch/project#" {pools ""} 
                resource scratchdisk "/isscratch3/Scratch/project#" {pools ""} 

        } 
}

ray.wurlod · Post by **ray.wurlod** » Wed Dec 12, 2012 6:29 pm

First point - Auto will always yield correct results (though it may not perform optimally).

Second point - Data Sets are read using the configuration with which they were written, even if this needs to be created on the fly, and mapped onto the configuration that the job is running.

In short, it will work.

Your "experience fellow" is wrong about the inserted tsort operators. We're happy to hear his evidence.

zaino22 · Post by **zaino22** » Wed Dec 12, 2012 7:04 pm

Thank you Ray for quick response. Good catch I corrected spelling.

In DS 7.5, job that had Join stage with Auto partitioning was working okay but when we migrated this Job to DS 8.1 results were not same. Once we hash partitioned input on keys and then sorted before joinining, results were same as DS 7.5. I rememer another experienced fellow told me not to use Auto post DS 7.5 as results may not be correct. Hence my question.

Would you please let me know answer for #3, which job is better design and efficient?

ray.wurlod · Post by **ray.wurlod** » Wed Dec 12, 2012 7:17 pm

I prefer to use explicit Join stage.

By default the job designs are identical. The score will show a tsort operator in each case, with 20MB memory allocated per node.
Environment variable APT_TSORT_STRESS_BLOCKSIZE affects every sort in the job.

The Join stage allows me to allocate more memory to the sort, and to generate key change columns. It also allows me to specify "don't sort (previously sorted)", which can save quite a deal of processing. The in-link sort does not provide any of these benefits.

Inserted tsort operators only occur where sorted data are needed and there is no explicit sort on that input link.

zaino22 · Post by **zaino22** » Thu Dec 13, 2012 6:51 am

Thank you Ray for sharing the insight.

I think you meant to say "I prefer to use explicit Sort stage." and "The Sort stage allows me to..."
I dont mean to be rude but hoping someone else doesnt get confused as we value experts' opinion.

Thanks again for your detailed and timely response.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 13, 2012 1:41 pm

You are correct. I meant to say "explicit Sort stage".

zulfi123786 · Post by **zulfi123786** » Mon Dec 17, 2012 2:07 pm

ray.wurlod wrote:First point - Auto will always yield correct results (though it may not perform optimally).

I recall an instance where I had multiple (20) join stages running on 2 nodes in V7.5 and while testing I came across data discrepancy and reran on single node which updated my data correctly, didn't have time to dive deep then but I was convinced that it was all due to auto partitioning and then on I have decided to take charge of which partitioning to be implemented across all stages. Not sure if anyone else have noticed such a thing.

zaino22 · Post by **zaino22** » Mon Dec 17, 2012 4:29 pm

@zulfi123785: Issue you and I had are similar.
IBM manual and best practices docs all suggest to use Auto option to let DS decide the partition. However I read Ray's response to another post where he recommended to use environment variable (I think score dump) to see what DS actually does behind the scene. If it doesn't use Hash partitioning and tsort operator and use some other partitioning like Round Robin then you know results won't be correct. I wish I could go back and do this to see what DS partitioning was done when I selected Auto option.

Also, I like to be the one controlling the output when I am developing DS code so selecting partitioning myself will ensure consistent results every time.

Thanks for your comments.