Parallel database extract stages

PhilHibbs · Post by **PhilHibbs** » Mon Aug 20, 2012 8:53 am

How are parallel database source stages executed? Are they really parallel, or does it just execute the SQL connection from one node? If it does try to multi-node it, how does it decide which node gets which row?

In Director Monitor, "Show Instances" just shows a single row count with no node numbers. Does that mean that the stage is always compiled in Sequential mode regardless of the Advanced setting on the stage?

And, is anyone else confused by the use of the term "Show Instances" here, where in other contexts an "instance" is a separate run of a job with a different Invocation ID?

eph · Post by **eph** » Mon Aug 20, 2012 9:28 am

Hi,

From my experience (on Oracle source) :
The type of parallelism in database source depends on how you configure it (enable partitioned reads).
=> round robin on rowid / Hash on rowid / modulus on numeric column / database partitionning.

For example, if you use rowid hash partitionning on a 2 nodes configuration, it will open two DB sessions that will run the query adding a WHERE statement as following :

Code: Select all

AND (MOD (ORA_HASH (TABLE_NAME.ROWID), 2) = 0) --for first partition
AND (MOD (ORA_HASH (TABLE_NAME.ROWID), 2) = 1) --for second partition

For a rowid round robin it will behave like this :

Code: Select all

AND (MOD (DBMS_ROWID.ROWID_ROW_NUMBER (TABLE_NAME.ROWID), 2) = 0) --for first partition
AND (MOD (DBMS_ROWID.ROWID_ROW_NUMBER (TABLE_NAME.ROWID), 2) = 1) --for second partition

Without partitioned reads enabled, the stage will run sequentially.

So what you can see as instances in the monitor would be DB sessions opened in parallel to perform this kind of queries "automatically" tuned by DS.

When you use specific SQL statement, you must give a table for the partitioned reads. Be aware that depending on your possible joins in you SQL statement, choosing the right table will potentially affect dramatically the performance of the query. Testing with an execution plan would be benefic in that case.

Regards,
Eric

PhilHibbs · Post by **PhilHibbs** » Mon Aug 20, 2012 10:48 am

eph wrote:From my experience (on Oracle source) :
The type of parallelism in database source depends on how you configure it (enable partitioned reads).
=> round robin on rowid / Hash on rowid / modulus on numeric column / database partitionning.

Does it look ahead at the partitioning on the input to the next stage? Because you can only specify partitioning on an input, not on a stage or an output.

BI-RMA · Post by **BI-RMA** » Tue Aug 21, 2012 2:01 am

The problem with partitioned reads is, that the selection of partitioning-methods is limited. If a modulus on a numeric column suits the needs of your job design you are all right. Using database partitioning is less flexible because you need to make sure you've got the same number of partitions in the database and in datastage to avoid repartitioning directly after input. This limits the ability to scale DataStage by adding compute-nodes.

you can only specify partitioning on an input, not on a stage or an output

Remember that reading data from a database is an import-operation populating a virtual dataset. You could actually import data using a round-robin partition algorythm and immediately repartition on the first active stage in DataStage. But this is almost sure to be less effective than reading the data sequentially from the database and partition only once.

eph · Post by **eph** » Tue Aug 21, 2012 4:02 am

Personally, I only use partitioned reads on huge tables, after a good talk with a dba to ensure that opening N sessions running the query won't affect the source DB negatively and to validate the query/indexes/hints used.

BI-RMA wrote:You could actually import data using a round-robin partition algorythm and immediately repartition on the first active stage in DataStage. But this is almost sure to be less effective than reading the data sequentially from the database and partition only once.

I'm no specialist but I usually find re-partitionning extremely fast, even on huge datasets so in my previous cases it wasn't an issue. Testing that functionality is pretty simple, so in any case you can validate or not the usage of partitioned reads easily.