How to make Oracle connector read source data in parallel

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
evee1
Premium Member
Premium Member
Posts: 96
Joined: Tue Oct 06, 2009 4:17 pm
Location: Melbourne, AU

How to make Oracle connector read source data in parallel

Post by evee1 »

I am having troubles with configuring parallel read in Oracle connector. The entire data is read twice - once per each node.

Firstly, my source table is not partitioned in any way. My understanding is that I still should be able to read the data in parallel.

I havce configured the Oracle connector in the following way:

Read mode: Select
Generate SQL at runtime: No
Select statement: SELECT <many fields> FROM <source_table> WHERE ROWNUM <= 10000
I have restricted the number of rows for the testing purposes
Enable partitioned reads: Yes
Partitioned reads method: Modulus
Not sure which method I should choose here
Column name for partitioned reads: ITEM_KEY
The field ITEM_KEY is Decimal(10)

When I run the job the following etries are in the log:

Code: Select all

The connector will use table <table_name> as input for the partitioned reads method.
The connector will run in parallel on 2 processing nodes.
The connector will use the following SELECT statement at runtime: SELECT <many fields> FROM <source_table> WHERE ROWNUM <= 10000
main_program: APT_SortedGroup2Operator::describeOperator nkeys: 7
...,0: Number of rows fetched on the current node: 10000.
...,1: Number of rows fetched on the current node: 10000.
I have read the documentation, but I'm not sure what I am doing wrong.
Not sure whether the SQL statement should not be modified in some way, but maybe it happens only in a case of Generate SQL at runtime option set to Yes (which I'm planning to do in the future).

I would appreciate some help, also a bit more explanation how the partitioning works in a case of Oracle connector.
Thanks.
ppgoml
Participant
Posts: 58
Joined: Mon Aug 20, 2007 11:00 pm

Post by ppgoml »

each node will generate a sql and send it to oracle to execute, datastage will append a "mod(your_column, num_of_node) = node_num" condition in your where clause in this case. so each node will return 1000 rows. that's the right behavior.
Jack Li
evee1
Premium Member
Premium Member
Posts: 96
Joined: Tue Oct 06, 2009 4:17 pm
Location: Melbourne, AU

Post by evee1 »

Hm, if this is the correct behaviour then it is not what I'm looking for - at the moment I'm loading 20000 records rather than 10000.

I still would like to know whether I can just "split" rows between two processing nodes, so they are loaded faster.
In the future this job needs to be able to process 600 millions records in the limited time. Probably the Ora connector will not be my only problem, but I would like to know how to address the parallel load anyway.
ppgoml
Participant
Posts: 58
Joined: Mon Aug 20, 2007 11:00 pm

Post by ppgoml »

if you want to read 10000 rows, and you are reading by 2 nodes, you should put rownum <= 5000 in your sql.

you are absolutely in the right direction. you are spliting the rows while enabling partition read. to read the database in parallel increase the performance much.

if you are loading the data rather than reading, the parallel loading is the default the option.
Jack Li
Post Reply