Performance issue reading data from Oracle connector

dsuser7 · Post by **dsuser7** » Wed Nov 06, 2013 8:28 pm

Hi All,

I'm select few columns from a table with around 70mil records (all records are being selected, no where clause).

And joining the data from the above with a data set of around 7 mil records.

When 6 fields are selected from oracle stage, the job finishes in less than 5 mins whereas the job takes around 30 mins to finish when 30 fields are selected.

Is this something to do with indexes?

When 6 are selected most fields have indexes defined except for 1, when 30 fields are selected, all except 4 fields have indexes defined on them.

Thanks for any help!

ArndW · Post by **ArndW** » Thu Nov 07, 2013 1:14 am

Can you make a copy of the job with no join, just read your Oracle source and write to a PEEK stage, measuring the time for 6 and for 30 fields. Also, can you estimate the size in bytes of the average record length for 6 and 30 field options?

dsuser7 · Post by **dsuser7** » Thu Nov 07, 2013 7:27 am

Thank you for your reply.

I have run jobs as suggested and here are the results:

1. Job with 6 columns runs 3 min 3 sec [148720 rows/sec] --- Avg record length in bytes = 57
2. Job with 30 columns runs 7 min 26 secs [368061 rows/sec] --- Avg record length in bytes = 143

Is the difference due to indexes?

dsuser7 · Post by **dsuser7** » Mon Nov 11, 2013 10:37 am

Would creating indexes on the other columns help performance of the job?

chulett · Post by **chulett** » Mon Nov 11, 2013 11:03 am

With no where clause? Nope.

dsuser7 · Post by **dsuser7** » Mon Nov 11, 2013 11:16 am

There is no where clause in the select query. So the increase in time taken for the job to finish is just because the number of columns selected are more (bytes are more)?

dsuser7 · Post by **dsuser7** » Mon Nov 11, 2013 11:19 am

Yikes, just realized, I inverted the rows/sec stats for the jobs, it should be

1. Job with 6 columns runs 3 min 3 sec [368,061 rows/sec] --- Avg record length in bytes = 57
2. Job with 30 columns runs 7 min 26 secs [148,720 rows/sec] --- Avg record length in bytes = 143

chulett · Post by **chulett** » Mon Nov 11, 2013 11:21 am

Yes. At a very high level, more bytes = more time.

dsuser7 · Post by **dsuser7** » Mon Nov 11, 2013 11:25 am

This really breaks my heart, because the time difference between 3.3min and 7min may not seem like much; but the actual job I'm concerned with joins 70 mil records with 7 mil dataset and it take around 40min.

chulett · Post by **chulett** » Mon Nov 11, 2013 11:26 am

In response to your stats - Rows per second is a fairly meaningless metric. You're better off comparing bytes per second using the total actual bytes rather than the ARL.

chulett · Post by **chulett** » Mon Nov 11, 2013 11:41 am

dsuser7 wrote:When 6 fields are selected from oracle stage, the job finishes in less than 5 mins whereas the job takes around 30 mins to finish when 30 fields are selected.

FYI - The time differential comes from the join and the amount of data it is throwing around, not any perceived 'performance issue' with the select from the Oracle connector.

dsuser7 · Post by **dsuser7** » Mon Nov 11, 2013 12:07 pm

Thanks Craig, just thought of the same!

kwwilliams · Post by **kwwilliams** » Mon Nov 11, 2013 7:56 pm

Have you tried enabling partitioned reads? If the table is a partitioned table you can have it read from each Oracle partition. If the table is not partitioned you can still read in parallel:

1. Enable partitioned reads = Yes
2. Partitioned reads method = Rowid round robin
3. table name for partitioned reads = The table name you are extracting data from

Doing this will create an Oracle process for each node in your configuration file, which should increase the number of bytes being read per second to help you reduce the overall runtime of the job.

prasson_ibm · Post by **prasson_ibm** » Tue Nov 12, 2013 2:51 am

Hi,
I have tested the parallel read with this logic but it seems data is duplicated in other partitions.

ray.wurlod · Post by **ray.wurlod** » Tue Nov 12, 2013 3:18 am

If that is true with rowid round robin partitioning then you have duplicate rows in source.

DSXchange

Performance issue reading data from Oracle connector

Performance issue reading data from Oracle connector

Re: Performance issue reading data from Oracle connector

Re: Performance issue reading data from Oracle connector

Re: Performance issue reading data from Oracle connector

Re: Performance issue reading data from Oracle connector