Parallel database extract stages

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Parallel database extract stages

Post by PhilHibbs »

How are parallel database source stages executed? Are they really parallel, or does it just execute the SQL connection from one node? If it does try to multi-node it, how does it decide which node gets which row?

In Director Monitor, "Show Instances" just shows a single row count with no node numbers. Does that mean that the stage is always compiled in Sequential mode regardless of the Advanced setting on the stage?

And, is anyone else confused by the use of the term "Show Instances" here, where in other contexts an "instance" is a separate run of a job with a different Invocation ID?
Phil Hibbs | Capgemini
Technical Consultant
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

Hi,

From my experience (on Oracle source) :
The type of parallelism in database source depends on how you configure it (enable partitioned reads).
=> round robin on rowid / Hash on rowid / modulus on numeric column / database partitionning.

For example, if you use rowid hash partitionning on a 2 nodes configuration, it will open two DB sessions that will run the query adding a WHERE statement as following :

Code: Select all

AND (MOD (ORA_HASH (TABLE_NAME.ROWID), 2) = 0) --for first partition
AND (MOD (ORA_HASH (TABLE_NAME.ROWID), 2) = 1) --for second partition
For a rowid round robin it will behave like this :

Code: Select all

AND (MOD (DBMS_ROWID.ROWID_ROW_NUMBER (TABLE_NAME.ROWID), 2) = 0) --for first partition
AND (MOD (DBMS_ROWID.ROWID_ROW_NUMBER (TABLE_NAME.ROWID), 2) = 1) --for second partition
Without partitioned reads enabled, the stage will run sequentially.

So what you can see as instances in the monitor would be DB sessions opened in parallel to perform this kind of queries "automatically" tuned by DS.

When you use specific SQL statement, you must give a table for the partitioned reads. Be aware that depending on your possible joins in you SQL statement, choosing the right table will potentially affect dramatically the performance of the query. Testing with an execution plan would be benefic in that case.

Regards,
Eric
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

eph wrote:From my experience (on Oracle source) :
The type of parallelism in database source depends on how you configure it (enable partitioned reads).
=> round robin on rowid / Hash on rowid / modulus on numeric column / database partitionning.
Does it look ahead at the partitioning on the input to the next stage? Because you can only specify partitioning on an input, not on a stage or an output.
Phil Hibbs | Capgemini
Technical Consultant
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

The problem with partitioned reads is, that the selection of partitioning-methods is limited. If a modulus on a numeric column suits the needs of your job design you are all right. Using database partitioning is less flexible because you need to make sure you've got the same number of partitions in the database and in datastage to avoid repartitioning directly after input. This limits the ability to scale DataStage by adding compute-nodes.
you can only specify partitioning on an input, not on a stage or an output
Remember that reading data from a database is an import-operation populating a virtual dataset. You could actually import data using a round-robin partition algorythm and immediately repartition on the first active stage in DataStage. But this is almost sure to be less effective than reading the data sequentially from the database and partition only once.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

Personally, I only use partitioned reads on huge tables, after a good talk with a dba to ensure that opening N sessions running the query won't affect the source DB negatively and to validate the query/indexes/hints used.
BI-RMA wrote:You could actually import data using a round-robin partition algorythm and immediately repartition on the first active stage in DataStage. But this is almost sure to be less effective than reading the data sequentially from the database and partition only once.
I'm no specialist but I usually find re-partitionning extremely fast, even on huge datasets so in my previous cases it wasn't an issue. Testing that functionality is pretty simple, so in any case you can validate or not the usage of partitioned reads easily.
Post Reply