Look up file set Vs Actual Database lookup

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
yuva010
Participant
Posts: 36
Joined: Thu Apr 24, 2008 7:12 pm

Look up file set Vs Actual Database lookup

Post by yuva010 »

Hi,
We need to look up tables from Oracle, there are two methods -
1. Either we should use direct database table to look up and fetch
2. Load the look up data set files from table and Look up on these files to fetch.

Which one is better?
Thanks,
Yuva.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

You need to define "better". This could mean many things - for instance quickest runtime, or least impact on the database, or using the least system resources, etc.
How big is the table and how big is the source data in comparison? How often does the table change?
yuva010
Participant
Posts: 36
Joined: Thu Apr 24, 2008 7:12 pm

Post by yuva010 »

We are interested in Quick runtime and least resource utilization.
There are different look up tables for diff jobs.
Few tables has millions of records others has very few records for look up.
Source data is always 1 million or so.
Thanks,
Yuva.
swapnilverma
Participant
Posts: 135
Joined: Tue Aug 14, 2007 4:27 am
Location: Mumbai

Post by swapnilverma »

Its based on data volume of input VS reference and the your system configuation.


If input volume is significantly smaller then reference table then go far sparce lookup. ( one SQL query for one record )

if both input are same then go and still the volume is not huge ( based on ur system configuration and apt used) go for direct table lookup.

If both Input and reference are coming from TABLE then you can go for SQL join as well.


Else use lookup file set or JOIN stage ...

You Decide 8) 8) 8)
Thanks
Swapnil

"Whenever you find whole world against you just turn around and Lead the world"
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If your lookup data changes once a month, then it makes sense to create a lookup fileset and use that on each run. If the reference data changes daily or each run of the job you can save the resources required to build the lookup and do it directly. This is just one of several choices you can make.

The goals of minimizing runtime as well as minimizing resource use are usually diametrically opposed. That is similar to expect MPG use on a car to get better with faster speeds - it doesn't work that way.

Mimizing resource use per row processed is a possible goal, but won't necessarily result in maximizing throughput.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

My recollection, when I had a look at the files getting created when the job runs, is that the job runs the same way using both methods but the job startup will be different. If you already have a lookup fileset the job will start processing straight away. If you don't have a lookup fileset and the lookup data cannot all fit into RAM memory the job will create the equivalent of lookup filesets for you in the node scratch space. So for very large lookups either way you get a lookup fileset - it just depends on whether you want to create it every time you run the job or create it less often to a schedule.
yuva010
Participant
Posts: 36
Joined: Thu Apr 24, 2008 7:12 pm

Post by yuva010 »

Hi,
The decision I have to make is - for least runtime and saving utilization of resources - based on volumn of Input Volumn and Look up volumn -

Source/IP Volumn Lookup Volumn
Low Moderate Large
Low Direct table Fileset or DT? Fileset or DT?
Moderate
Large
Thanks,
Yuva.
Post Reply