Need to know more about performance tuning

dsedi · Post by **dsedi** » Tue Jul 04, 2006 5:53 am

Hi all

I am working on PX 7.0, 4 processors AIX system.
I used 4 configuration files(1 node, 2 node, 3 node and 4 node) for a particular job one by one to see which of the files will give me best rows/sec performance boost.

I was expecting the result to be best for 2 node file as it is mentioned earlier that one should go for no. of processor by 2 config file.

But the results are best for 1 node file.
I tried auto partitioning, as well as round robin partitioning but the results remain unchanged.

My job has a source (oracle stage), a transformer stage and a target Dataset stage.

Are there any other envrionment variables which i am suppose to set other then Apt_config_file.

Thanks in advance.

ArndW · Post by **ArndW** » Tue Jul 04, 2006 7:13 am

In most PX jobs the best performance may well be with 1-node configurations. The process overhead of starting and coordinating many parallel process usually outweighs the performance benefits for jobs that only process thousands instead of millions of records.

As a rule of thumb I would default all PX jobs to use a 1-node configuration regardless of your system's CPU count and memory size. Those jobs that run more the 5 or 10 minutes should use more nodes, but not necessarily the maximum recommended 1-node-per-2-CPU count that you've already alluded to.

kumar_s · Post by **kumar_s** » Tue Jul 04, 2006 8:20 am

As mentioned, it depends up on the available number of rows. Is it MPP or SMP?
SMP architecture should get benifited with more nodes for large file processing at most of the time.

ArndW · Post by **ArndW** » Tue Jul 04, 2006 8:42 am

Kumar,

that is not necessarily the case. Here with a 24CPU P-Series it turns out that a 1-node configuration is fastest for many of the jobs.

Just as an example, take outputting to a database that resides on a remote machine. If the bandwidth between the DataStage server and the DataBase server is limited, then you can add processors, processing nodes, memory and whatever else to the DataStage server to your heart's content and it won't make the data load faster. But it will fill up your server with lots of processes and interprocess communication so that any other jobs running concurrently will get slowed down and your system's context switching rates will go up.

kumar_s · Post by **kumar_s** » Tue Jul 04, 2006 9:00 am

I do accept of some rare such cases on E and L part of ETL. But as for as T is concern max node should the right choice. And that should make sense of choosig PX.

ArndW · Post by **ArndW** » Tue Jul 04, 2006 9:19 am

A good measurement there is to use dump_score and look at the T-type stages (i.e. a modify stage or a transform).
As an example, if you have a 2-node configuration and they are at only 80% CPU usage and when you run it with a 4-node configuration they are at 45% you are getting the same throughput but are using extra processes. If the "active" stages are at close to 100% cpu then it makes sense to add another level of parallelism in that job.

I think my worry is that by using a single job as an example or a sample for a test you can get another 10% performance increase by added virtual nodes. But in a real multiuser environment it is important to look at overall performance - the additional 10% performance might have used up so many resources that the rest of the system becomes slower. Just using the maximum or a high number of parallel nodes in PX without giving thought to why is just wasting system resources.

kumar_s · Post by **kumar_s** » Tue Jul 04, 2006 11:17 pm

I think my worry is that by using a single job as an example or a sample for a test you can get another 10% performance increase by added virtual nodes.

Absolutely, in multiuser environment, on multiprocess running simulteneously, if they were designed perfectly, can have 100% CPU usage.
In general, E and L will have more IO constraint than CPU usage. And T will have more CPU usage where other constraints might be least considired.
Now there are many stages and database arragement, where the E and L have less IO and with seamless sink of partiton of DS and DB.
There may be some curcustance were all the CPU hungray jobs might get triggered simulteneously and wait for CPU to get free.
So its all again depends on way we design and the available sources (resources).

dsedi · Post by **dsedi** » Thu Jul 06, 2006 11:06 pm

Hi Kumar

I tried to keep my job's design simple and then trying to use config files one after the other, simultaneoulsy i tried to change various types of partitions.

But the best result I got was for

Oracle stage partition type =>Entire
Config file=>1 node

Why is it that even though I am partitioning the data and then having 2 nodes to work on individual data partions, still the job takes more time.

I am not sure whether i am trying to optimize my job in the right way

Could you please guide me

Thanks in advance.

atul sharma · Post by **atul sharma** » Fri Jul 07, 2006 4:18 am

As mentioned i tried using dump_score environment variable to make out.

below is the info which it provided

main_program: This step has 3 datasets:
ds0: {op0[1p] (sequential Oracle_Enterprise_0)
eAny<>eCollectAny
op1[3p] (parallel APT_TransformOperatorImplV0S2_OT_Transformer_2 in Transformer_2)}
ds1: {op2[2p] (parallel delete data files in delete /app/scr/ds/scr/src/testing.ds)
>>eCollectAny
op3[1p] (sequential delete descriptor file in delete /app/scr/ds/scr/src/testing.ds)}
ds2: {op1[3p] (parallel APT_TransformOperatorImplV0S2_OT_Transformer_2 in Transformer_2)
=>
/app/scr/ds/scr/src/testing.ds}
It has 4 operators:
op0[1p] {(sequential Oracle_Enterprise_0)
on nodes (
node1[op0,p0]
)}
op1[3p] {(parallel APT_TransformOperatorImplV0S2_OT_Transformer_2 in Transformer_2)
on nodes (
node1[op1,p0]
node2[op1,p1]
node3[op1,p2]
)}
op2[2p] {(parallel delete data files in delete /app/scr/ds/scr/src/testing.ds)
on nodes (
node1[op2,p0]
node2[op2,p1]
)}
op3[1p] {(sequential delete descriptor file in delete /app/scr/ds/scr/src/testing.ds)
on nodes (
node1[op3,p0]
)}
It runs 7 processes on 3 nodes.

Could you please explain me what all lies in this.
I am only able to make out that there are 7 processes which are running on 3 nodes

I am also not able to figure out why is Oracle stage being shown as sequential and only one process assigned to it.

Thanks in advance.

kumar_s · Post by **kumar_s** » Tue Jul 11, 2006 6:08 pm

[quote="dsedi"]Hi Kumar

Oracle stage partition type =>Entire
Config file=>1 node

Why is it that even though I am partitioning the data and then having 2 nodes to work on individual data partions, still the job takes more time.
quote]
As Arnd mentioned and I mentioned, you may have less to do with the IO stage with multi node with this simple design of input - partition - output. Since it takes more time to partition.
and also the one this job running in the node.
You can leverage the full PX when you have many process running in CPU, and the required transformation is more (inturns which require the more CPU).
Again if you have a sequential file are any DB to write in sequential, you might compromise on the performance for funneling the data into single node.
For these case like, when jthe job is made to run alone in server, and which has very simple design of which likely to have huge IO, it is preferable to have single node config.
If you have huge transformation, it is always recomended to have multinode as possible.

sribuz · Post by **sribuz** » Wed Dec 17, 2008 6:58 pm

Kumar_s,

I understand you say.. Prefer single node config for simple design jobs like below..
text_file->transformer->database
or
database->copy->database
which donot have much transformation.

How can we increase the performance of jobs having simple design ? other than playing with array & transaction size.

kandyshandy · Post by **kandyshandy** » Wed Dec 17, 2008 8:08 pm

Making sure that each partition has more or less the same no. of records for processing. Next question might be how to do this? Auto partitioning takes care of this in most of the stages but we might be forced to do hash partitioning for some cases where each partition may not have "likely" same no. of records.

BTW, what are you trying to achieve? or are you trying for interview question answers? Just kidding !!

Mike · Post by **Mike** » Thu Dec 18, 2008 8:25 am

Actually hash partitioning will almost always introduce data skew... sometimes very extreme... try hash partitioning a set of records where they all have the same key and you'll see 100% data skew. Hash partitioning is normally used for key-based operations where you need to guarantee that records having identical keys will end up in the same partition.

Round robin partitioning is normally used to achieve near equal size partitions (+/- 1 record).

I personally hate the idea of auto partitioning. To me it is IBM saying "our developers are too dumb to manage partitioning so we'll manage it for them".

Mike

sribuz · Post by **sribuz** » Thu Dec 18, 2008 11:55 am

[quote="kandyshandy"] Next question might be how to do this? Auto partitioning takes care of this in most of the stages but we might be forced to do hash partitioning for some cases where each partition may not have "likely" same no. of records.

BTW, what are you trying to achieve? or are you trying for interview question answers? Just kidding !! [/quote]

Its always confusing for me when to use 1 node or 2/multi node ?
I was digging through all the threads.. found multi node is the way to achieve parallelism, you can opt for a job running more than 10 mins.

But, this thread was kind of contrast as kumar_s and ArndW say.. its preffered to go with single node for simple design which doesn't have much transformation.

kandyshandy · Post by **kandyshandy** » Thu Dec 18, 2008 1:03 pm

Single node job is equivalent to a server job. No one other than you can decide which best fits for your requirements. Everyone suggests that if the no. of records is small then go with server job or 1node px job. If the volume is more, then go with multi node.

Or we can say like this. Dimensions (mostly) can be developed as server job or 1node px job whereas facts can be multi node px job.

The best way is to experiment at your work. Try both 'server or 1 node' and multi node jobs and test them.