Partitioning.....

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Adam_Clone
Participant
Posts: 26
Joined: Fri Apr 08, 2005 12:58 am

Partitioning.....

Post by Adam_Clone »

Hi all !
Does specifying partitioning help in any way on a single processor system, i mean can the co-processor available with the pentiums/AMDs be used as a second node for better processing through partitioning and pipelining ? If so, what will be the best way to specify the partitioning. Will auto-partitioning do the trick for best confi ?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Adam,

the partioning of data is more a function of the data, it's entry and subsequent access than it is of the number of CPU's. The goal is to make some function faster, i.e. takes less CPU cycles to execute the query or write statement. Partitioning can significantly decrease the times for an optimized query to execute in a DWH - imagine a nice date query on a table partitioned by MONTH that contains Terabytes of data; stay within a partition and even a full-table-scan on that partition will be quick. Work on a non-partitioned table with the same query and it will execute as long as Douglas Adam's query which resulted in an answer of 42.

The partitioning of the your DS job's input can be left alone in many cases. If you have a job where you perform lookups and can partition your source data as well as your lookups so that no query go accross partition boundaries you can get some great performance boosts - but by getting the partitioning wrong your results will be incorrect.

Partitioning is one of the boons of Px and a place where you can see real performance boosts; but I think that the Pareto law applies here as well, meaning that 20% of your jobs need 80% of the tuning...
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Adam,

the partioning of data is more a function of the data, it's entry and subsequent access than it is of the number of CPU's. The goal is to make some function faster, i.e. takes less CPU cycles to execute the query or write statement. Partitioning can significantly decrease the times for an optimized query to execute in a DWH - imagine a nice date query on a table partitioned by MONTH that contains Terabytes of data; stay within a partition and even a full-table-scan on that partition will be quick. Work on a non-partitioned table with the same query and it will execute as long as Douglas Adam's query which resulted in an answer of 42.

The partitioning of the your DS job's input can be left alone in many cases. If you have a job where you perform lookups and can partition your source data as well as your lookups so that no query go accross partition boundaries you can get some great performance boosts - but by getting the partitioning wrong your results will be incorrect.

Partitioning is one of the boons of Px and a place where you can see real performance boosts; but I think that the Pareto law applies here as well, meaning that 20% of your jobs need 80% of the tuning...
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

:lol: ,
IMHO:
Assuming this also means no more then 1 controller/disk (a basic PC with Linux???)
if it does in a rare situation it would be most likely negligable.

it all boils down to a running process/es sharing limited resources, in this case 1 CPU.

You might gain something if the process has some waiting points not using CPU.

You might gain something if you can read the data in parallel if you have several disk controlers and the task is mostly I/O bound
i.e. 1M rows at 1000 rows/seconds vrs. 4 reads of 250,000 rows at 400 rows/second.

since no one uses this kind of configuration for a standard production system this is all academic and of no real interest to most people.
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
Post Reply