Page 1 of 2

Performance issue reading data from Oracle connector

Posted: Wed Nov 06, 2013 8:28 pm
by dsuser7
Hi All,

I'm select few columns from a table with around 70mil records (all records are being selected, no where clause).

And joining the data from the above with a data set of around 7 mil records.

When 6 fields are selected from oracle stage, the job finishes in less than 5 mins whereas the job takes around 30 mins to finish when 30 fields are selected.

Is this something to do with indexes?

When 6 are selected most fields have indexes defined except for 1, when 30 fields are selected, all except 4 fields have indexes defined on them.

Thanks for any help!

Posted: Thu Nov 07, 2013 1:14 am
by ArndW
Can you make a copy of the job with no join, just read your Oracle source and write to a PEEK stage, measuring the time for 6 and for 30 fields. Also, can you estimate the size in bytes of the average record length for 6 and 30 field options?

Posted: Thu Nov 07, 2013 7:27 am
by dsuser7
Thank you for your reply.

I have run jobs as suggested and here are the results:

1. Job with 6 columns runs 3 min 3 sec [148720 rows/sec] --- Avg record length in bytes = 57
2. Job with 30 columns runs 7 min 26 secs [368061 rows/sec] --- Avg record length in bytes = 143

Is the difference due to indexes?

Posted: Mon Nov 11, 2013 10:37 am
by dsuser7
Would creating indexes on the other columns help performance of the job?

Posted: Mon Nov 11, 2013 11:03 am
by chulett
With no where clause? Nope.

Posted: Mon Nov 11, 2013 11:16 am
by dsuser7
There is no where clause in the select query. So the increase in time taken for the job to finish is just because the number of columns selected are more (bytes are more)?

Posted: Mon Nov 11, 2013 11:19 am
by dsuser7
Yikes, just realized, I inverted the rows/sec stats for the jobs, it should be

1. Job with 6 columns runs 3 min 3 sec [368,061 rows/sec] --- Avg record length in bytes = 57
2. Job with 30 columns runs 7 min 26 secs [148,720 rows/sec] --- Avg record length in bytes = 143

Posted: Mon Nov 11, 2013 11:21 am
by chulett
Yes. At a very high level, more bytes = more time.

Posted: Mon Nov 11, 2013 11:25 am
by dsuser7
This really breaks my heart, because the time difference between 3.3min and 7min may not seem like much; but the actual job I'm concerned with joins 70 mil records with 7 mil dataset and it take around 40min.

Posted: Mon Nov 11, 2013 11:26 am
by chulett
In response to your stats - Rows per second is a fairly meaningless metric. You're better off comparing bytes per second using the total actual bytes rather than the ARL.

Re: Performance issue reading data from Oracle connector

Posted: Mon Nov 11, 2013 11:41 am
by chulett
dsuser7 wrote:When 6 fields are selected from oracle stage, the job finishes in less than 5 mins whereas the job takes around 30 mins to finish when 30 fields are selected.
FYI - The time differential comes from the join and the amount of data it is throwing around, not any perceived 'performance issue' with the select from the Oracle connector.

Re: Performance issue reading data from Oracle connector

Posted: Mon Nov 11, 2013 12:07 pm
by dsuser7
Thanks Craig, just thought of the same!

Re: Performance issue reading data from Oracle connector

Posted: Mon Nov 11, 2013 7:56 pm
by kwwilliams
Have you tried enabling partitioned reads? If the table is a partitioned table you can have it read from each Oracle partition. If the table is not partitioned you can still read in parallel:

1. Enable partitioned reads = Yes
2. Partitioned reads method = Rowid round robin
3. table name for partitioned reads = The table name you are extracting data from

Doing this will create an Oracle process for each node in your configuration file, which should increase the number of bytes being read per second to help you reduce the overall runtime of the job.

Re: Performance issue reading data from Oracle connector

Posted: Tue Nov 12, 2013 2:51 am
by prasson_ibm
Hi,
I have tested the parallel read with this logic but it seems data is duplicated in other partitions. :(

Posted: Tue Nov 12, 2013 3:18 am
by ray.wurlod
If that is true with rowid round robin partitioning then you have duplicate rows in source.