Page 1 of 1

Best Ideas to solve a specific performance bottleneck needed

Posted: Tue Oct 02, 2007 8:25 am
by dcguuenther
Here is my performance bottleneck.

Input = 2 billion records.
Reference Data = 10 million records.

We currently are using a lookup stage as opposed to a join stage to avoid having to sort the input 2 billion records from the mainframe file.

The Lookup Stage is set to sequential to not load 10 million records into memory for each partition we are running on. The fact that this is sequential is killing me. The job is currently looking up to the the Oracle Stage, which is doing a table read (aka Normal Lookup instead of Sparse)

What ideas do you have in order to perform this an alternative way. The only idea I can have left to try is to try using a lookupfileset. Should this help. Any other ideas?

Posted: Tue Oct 02, 2007 1:51 pm
by ray.wurlod
Identically partition the two inputs, based on the "join key" field(s).

The lookup data will be loaded into memory in every case except a sparse lookup; partitioned data allow the amount of data per node to be minimized.

Use as few fields (columns) as possible on the reference input link.
Use as few rows as possible on the reference input link - that is, include a WHERE clause in the Oracle query so as to return only "current" rows.