Lookup Stage VS Outer Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
my_stm
Premium Member
Premium Member
Posts: 58
Joined: Mon Mar 19, 2007 9:49 pm
Location: MY

Lookup Stage VS Outer Join

Post by my_stm »

Hi all,

I created a job with estimated 8-10 million record coming from source. And does a lookup to lookup file set that contains nearly 1+ million records. However was told that if the data volume on both sides (source and lookup reference) are huge, it's better to use a join stage (In my case an outer join) to yield better peformance.

I tried doing comparison for above scenario between lookup job and outer join job and found that the lookup job still runs faster than the outer join.

Was wondering why there's so many recommendation by users to go for join instead of lookup? Can anyone help enlighten me? Or did I missed out something?

Thanks in advanced.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Obviously one million rows is still below the "large" threshold.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
my_stm
Premium Member
Premium Member
Posts: 58
Joined: Mon Mar 19, 2007 9:49 pm
Location: MY

Post by my_stm »

ray.wurlod wrote:Obviously one million rows is still below the "large" threshold.
Hi Ray,

I've just created a job that has source of 8 million records doing lookup to lookup file sets that also contain 8 million records.

With another job that has source of 8 million, performing outer join with reference dataset that also contains 8 million records.

Still the job that does lookup to the lookup file sets runs faster than that of the outer join.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How many processing nodes? For example with 8 nodes that is still only one million rows per node. DataStage parallel jobs are capable of some real "heavy lifting" when properly configured.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gbusson
Participant
Posts: 98
Joined: Fri Oct 07, 2005 2:50 am
Location: France
Contact:

Post by gbusson »

is it a SMP system?

if yes, setting the reference link partitionning to entire does not duplicate values, but set it in shared memory.

If your data is not too big, (be aware that the number of rows is important, but also the length of the records), it can work well!
mcs_giri
Participant
Posts: 14
Joined: Sat Sep 22, 2007 8:44 am
Location: chennai

Post by mcs_giri »

Its always preferable to go for join are merge when you are using large volume of records. its because Lookup opreration is carried out in the vitual RAM area so if you are going to use large volume of records it is definitely going to reduces your performance in terms of memory.
GIRIDHARANJ
mcs_giri
Participant
Posts: 14
Joined: Sat Sep 22, 2007 8:44 am
Location: chennai

Post by mcs_giri »

Its always preferable to go for join are merge when you are using large volume of records. its because Lookup opreration is carried out in the vitual RAM area so if you are going to use large volume of records it is definitely going to reduces your performance in terms of memory.
GIRIDHARANJ
my_stm
Premium Member
Premium Member
Posts: 58
Joined: Mon Mar 19, 2007 9:49 pm
Location: MY

Post by my_stm »

Hi everyone,

Thanks very much for the feedback. =)

Apologize for absence I was away.

There's some development in the front of testing between lookup and join performance.

I noticed that yes, lookup does consume a lot of memory if the number of records is excessive, which results in abortion. I noticed that join does not consume that much memory but it does come in a high price- in terms of the scrach space/physical space allocated to the directory where scratch resides.

For example, when I perform join between 3 sources and each source possessing millions of records and containing at least 60+ fields, I noticed that the physical space usage will sky rocket to the extend causing abortion with error that scratch space treshold reached.

So it's a matter of evaluating which is the best option and sad to say I have yet to grasp the technicallity of performance pre-tuning/post-tuning.

I was hoping if there's any documentation available that might explain/caution developers regarding performance tuning: The do's and don'ts, the preliminary planning in terms of hardware/memory before developing a job?

Could anyone reference me to those docs other than approaching IBM support.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No documentation is available other than the manuals, the training classes and the conference presentations. You have to attend to get the last two.

It's supply and demand - if you keep running out of resources, supply more (physical memory, disk space, etc.) or reduce demand.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vijay.rajendran
Participant
Posts: 6
Joined: Thu Sep 29, 2005 8:39 pm
Location: Sydney, Australia

Post by vijay.rajendran »

As Ray mentioned there is no document available for this. This tunning is purely dependent on the volume of data and hardware / resource available.

In your scenario (ie. the input being 8Mil+ and Lookup being 1+ mil approx) you are better off using lookup. Some thing you have to look into are how much memory you have on your machine and how much your lookup dataset will take.

If the record length is small and the number of records is approximately 1+ Mil I will go for Lookup Stage.

Good luck.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You should also scrutinize (with an intense scruit) whether you really need all those columns and all those rows in your reference data set.

For example, if all you are determining is existence, all you need is key columns. If you're mapping business to surrogate key, all you need is those two keys. Only load current rows - don't load any row in to the reference data set whose effective_end_date is in the past or whose effective_start_date is in the future. And so on.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
my_stm
Premium Member
Premium Member
Posts: 58
Joined: Mon Mar 19, 2007 9:49 pm
Location: MY

Post by my_stm »

Hi all,

Thanks again for feedback. I should only use the fields that are necessary.

In any case, I know following is my server specs:

ETL Development Server ( 10.0.110.8 ) Spec.
Operating System : AIX 5.3
CPU : 2 CPU
Memory : 6 GB
Disk Space : 100 GB

For etldev1 user
Set File Size = unlimit

Any tips on how I can properly optimize the job? Other than limit the use of active stages (transformers, sort stages, etc...) and also taking note of number of fields/records etc...


Is there a way for me to manually determine how the entire process is balancely split between the 2 cpus/nodes I have?
Last edited by my_stm on Tue Nov 13, 2007 11:47 pm, edited 3 times in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Use Monitor in Director. This can show what's happening on each partition. Then relate this to the score (you need $APT_DUMP_SCORE set) to relate the work to the node(s) on which it is being performed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply