Best Ideas to solve a specific performance bottleneck needed

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dcguuenther
Participant
Posts: 13
Joined: Wed Feb 08, 2006 3:06 pm
Location: Chicago

Best Ideas to solve a specific performance bottleneck needed

Post by dcguuenther »

Here is my performance bottleneck.

Input = 2 billion records.
Reference Data = 10 million records.

We currently are using a lookup stage as opposed to a join stage to avoid having to sort the input 2 billion records from the mainframe file.

The Lookup Stage is set to sequential to not load 10 million records into memory for each partition we are running on. The fact that this is sequential is killing me. The job is currently looking up to the the Oracle Stage, which is doing a table read (aka Normal Lookup instead of Sparse)

What ideas do you have in order to perform this an alternative way. The only idea I can have left to try is to try using a lookupfileset. Should this help. Any other ideas?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Identically partition the two inputs, based on the "join key" field(s).

The lookup data will be loaded into memory in every case except a sparse lookup; partitioned data allow the amount of data per node to be minimized.

Use as few fields (columns) as possible on the reference input link.
Use as few rows as possible on the reference input link - that is, include a WHERE clause in the Oracle query so as to return only "current" rows.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply