Lookup Stage VS Outer Join

my_stm · Post by **my_stm** » Wed Oct 17, 2007 9:03 pm

Hi all,

I created a job with estimated 8-10 million record coming from source. And does a lookup to lookup file set that contains nearly 1+ million records. However was told that if the data volume on both sides (source and lookup reference) are huge, it's better to use a join stage (In my case an outer join) to yield better peformance.

I tried doing comparison for above scenario between lookup job and outer join job and found that the lookup job still runs faster than the outer join.

Was wondering why there's so many recommendation by users to go for join instead of lookup? Can anyone help enlighten me? Or did I missed out something?

Thanks in advanced.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 18, 2007 9:09 am

Obviously one million rows is still below the "large" threshold.

my_stm · Post by **my_stm** » Thu Oct 18, 2007 2:30 pm

ray.wurlod wrote:Obviously one million rows is still below the "large" threshold.

Hi Ray,

I've just created a job that has source of 8 million records doing lookup to lookup file sets that also contain 8 million records.

With another job that has source of 8 million, performing outer join with reference dataset that also contains 8 million records.

Still the job that does lookup to the lookup file sets runs faster than that of the outer join.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 18, 2007 6:40 pm

How many processing nodes? For example with 8 nodes that is still only one million rows per node. DataStage parallel jobs are capable of some real "heavy lifting" when properly configured.

gbusson · Post by **gbusson** » Mon Oct 22, 2007 8:01 am

is it a SMP system?

if yes, setting the reference link partitionning to entire does not duplicate values, but set it in shared memory.

If your data is not too big, (be aware that the number of rows is important, but also the length of the records), it can work well!

mcs_giri · Post by **mcs_giri** » Wed Oct 24, 2007 8:23 am

Its always preferable to go for join are merge when you are using large volume of records. its because Lookup opreration is carried out in the vitual RAM area so if you are going to use large volume of records it is definitely going to reduces your performance in terms of memory.

mcs_giri · Post by **mcs_giri** » Wed Oct 24, 2007 8:24 am

Its always preferable to go for join are merge when you are using large volume of records. its because Lookup opreration is carried out in the vitual RAM area so if you are going to use large volume of records it is definitely going to reduces your performance in terms of memory.

my_stm · Post by **my_stm** » Mon Nov 12, 2007 9:17 pm

Hi everyone,

Thanks very much for the feedback. =)

Apologize for absence I was away.

There's some development in the front of testing between lookup and join performance.

I noticed that yes, lookup does consume a lot of memory if the number of records is excessive, which results in abortion. I noticed that join does not consume that much memory but it does come in a high price- in terms of the scrach space/physical space allocated to the directory where scratch resides.

For example, when I perform join between 3 sources and each source possessing millions of records and containing at least 60+ fields, I noticed that the physical space usage will sky rocket to the extend causing abortion with error that scratch space treshold reached.

So it's a matter of evaluating which is the best option and sad to say I have yet to grasp the technicallity of performance pre-tuning/post-tuning.

I was hoping if there's any documentation available that might explain/caution developers regarding performance tuning: The do's and don'ts, the preliminary planning in terms of hardware/memory before developing a job?

Could anyone reference me to those docs other than approaching IBM support.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 12, 2007 9:37 pm

No documentation is available other than the manuals, the training classes and the conference presentations. You have to attend to get the last two.

It's supply and demand - if you keep running out of resources, supply more (physical memory, disk space, etc.) or reduce demand.

vijay.rajendran · Post by **vijay.rajendran** » Mon Nov 12, 2007 10:54 pm

As Ray mentioned there is no document available for this. This tunning is purely dependent on the volume of data and hardware / resource available.

In your scenario (ie. the input being 8Mil+ and Lookup being 1+ mil approx) you are better off using lookup. Some thing you have to look into are how much memory you have on your machine and how much your lookup dataset will take.

If the record length is small and the number of records is approximately 1+ Mil I will go for Lookup Stage.

Good luck.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 12, 2007 11:46 pm

You should also scrutinize (with an intense scruit) whether you really need all those columns and all those rows in your reference data set.

For example, if all you are determining is existence, all you need is key columns. If you're mapping business to surrogate key, all you need is those two keys. Only load current rows - don't load any row in to the reference data set whose effective_end_date is in the past or whose effective_start_date is in the future. And so on.

my_stm · Post by **my_stm** » Tue Nov 13, 2007 8:43 pm

Hi all,

Thanks again for feedback. I should only use the fields that are necessary.

In any case, I know following is my server specs:

ETL Development Server ( 10.0.110.8 ) Spec.
Operating System : AIX 5.3
CPU : 2 CPU
Memory : 6 GB
Disk Space : 100 GB

For etldev1 user
Set File Size = unlimit

Any tips on how I can properly optimize the job? Other than limit the use of active stages (transformers, sort stages, etc...) and also taking note of number of fields/records etc...

Is there a way for me to manually determine how the entire process is balancely split between the 2 cpus/nodes I have?

ray.wurlod · Post by **ray.wurlod** » Tue Nov 13, 2007 10:52 pm

Use Monitor in Director. This can show what's happening on each partition. Then relate this to the score (you need $APT_DUMP_SCORE set) to relate the work to the node(s) on which it is being performed.