Data base Look up in server and Parallel , Hash/data set loo

dsuser1 · Post by **dsuser1** » Wed Dec 03, 2003 5:32 pm

I have been going through some of the postings in this forum on look up in general and I am trying to make some conclusions. Please put in your expert advise.

If I have to do a database look up (eg. against a oracle table) in server job, there is no separate look up stage and the oracle stage can be directly used for look from transformer stage. Here each record is looked up with the database using an SQL statement issued to the database with the Key condition in WHERE clause. Is this correct? (or Is it like Ascential brings the look up table in memory and does the matching?)

Now If I have to do a database look up (eg. against a oracle table) in parallel job, there is a separate look up stage and the oracle stage can be linked to the look up stage for look up. Here there is no key matching issued to the database in the WHERE clause. The SQL which runs on database can return a set of records depending on the SQL and I guess the matching is done by the look up stage in memory of ETL server. Is this correct?

Now you can also download the table data to a data set or hash file and do a look up. What is the pros and cons of these approaches? Which is the best one in terms of performance? If I have a large database table (UDB) to be looked up in my ETL job, which is the best method I should go for?

T.I.A.

Teej · Post by **Teej** » Wed Dec 03, 2003 8:08 pm

dsuser1 wrote:If I have to do a database look up (eg. against a oracle table) in server job, there is no separate look up stage and the oracle stage can be directly used for look from transformer stage. Here each record is looked up with the database using an SQL statement issued to the database with the Key condition in WHERE clause. Is this correct? (or Is it like Ascential brings the look up table in memory and does the matching?)

Correct. This is why it is well advised to use a Hash stage between the database stage and the lookup link being used at the very minimum. This would reduce the fetching of the database to as low as one fetch. There are limitation to this -- particularly the 2gb size limit for 32 bit hash files.

Now If I have to do a database look up (eg. against a oracle table) in parallel job, there is a separate look up stage and the oracle stage can be linked to the look up stage for look up. Here there is no key matching issued to the database in the WHERE clause. The SQL which runs on database can return a set of records depending on the SQL and I guess the matching is done by the look up stage in memory of ETL server. Is this correct?

Correct. It is all loaded onto memory/local disk using datasets before the lookup is processed.

A little cavet: There are two different type of lookups you can do (at least that I know of for Oracle to lookup stage), a Normal Lookup and Sparse Lookup. Each have their own benefits.

Now you can also download the table data to a data set or hash file and do a look up. What is the pros and cons of these approaches? Which is the best one in terms of performance? If I have a large database table (UDB) to be looked up in my ETL job, which is the best method I should go for?

Your best method is to upgrade to 7.0.1, as the lookup stage's performance for PX has been said to really improve in its use of resources, and performance especially for a large dataset (in the million of rows).

Honestly, if you have a very large amount of data being shifted, it is well advised to give PX a try for that particular behavior. The good ol' pros here have a number of tricks to improve performance for DataStage Server in making it more parallel-like, but I still believe and have observed better performance with PX for 6.x, and have no doubt that 7.0.1 (which I saw some components and discussed with several Ascential folks) will provide the best performance with PX lookup thus far. It should be out this month.

-T.J.

kcbland · Post by **kcbland** » Wed Dec 03, 2003 9:41 pm

dsuser1 wrote: If I have to do a database look up (eg. against a oracle table) in server job, there is no separate look up stage and the oracle stage can be directly used for look from transformer stage. Here each record is looked up with the database using an SQL statement issued to the database with the Key condition in WHERE clause. Is this correct?

Yes. At startup time the job establishes a connection to the instance. For each and every row in the transformer primary input stream a SQL statement will be executed against the instance. 100K rows on the input stream equals 100K queries fired against that instance.

dsuser1 wrote: (or Is it like Ascential brings the look up table in memory and does the matching?)

No.

dsuser1 wrote: Now If I have to do a database look up (eg. against a oracle table) in parallel job, there is a separate look up stage and the oracle stage can be linked to the look up stage for look up. Here there is no key matching issued to the database in the WHERE clause. The SQL which runs on database can return a set of records depending on the SQL and I guess the matching is done by the look up stage in memory of ETL server. Is this
correct?

Yes. See my post here for a method of reducing the volume you're going to bring into DataStage (Server or Parallel, does't matter) for reference:
viewtopic.php?t=85958

dsuser1 wrote: Now you can also download the table data to a data set or hash file and do a look up. What is the pros and cons of these approaches?

Hash files are very supportive of a sandbox approach to doing ETL. They have a significant number of benefits over a relational sandbox solution. Parallel jobs have restricted sandbox applicability with their ".ds" datasets. You may want to read this lengthy soapbox I did about sandboxes:
viewtopic.php?t=85822

dsuser1 wrote: Which is the best one in terms of performance? If I have a large database table (UDB) to be looked up in my ETL job, which is the best method I should go for?

Speed kills. I'll use an analogy first. You know those "top fuel dragsters"? Those are those long, skinny race cars with really big wheels on the back, tiny bicycle wheels in the front. They have a 5 gallon fuel tank and burn alcohol. They go about 300 mph in a straight line. Now, if you need to go realllly fast in a straight line, get yourself one of those. If you need maximum steerability and ruggedness you can take off road, go get yourself a Mini-Cooper S (the old ones, not those Yuppie BMW knockoffs).

So, Parallel jobs have a lot of overhead (build ops, hardware, node/resource pool configuration, VERY picky source data quality, etc) but perform the fastest. Hand written C code is the maximum flexibility and ruggedness, but pretty crude. Server jobs are a middle of the road.

The answer is, code for maintainability as much as possible. Theoretically, Parallel jobs should blow everything else away. But, because what you are sacrificing in your need for speed. We were doing fine before Parallel technology came along, we still loaded terabyte warehouses. Smart, thoughtful coding with Server jobs should be adequate. Lazy coding will not be overcome by Parallel technology. But smart, thoughtful coding with Parallel is the right answer. You have to know what your requirements are before you choose the technology.

Let me give one last poor analogy. Which is the faster database, Oracle, DB2, SQL-Server, RedBrick. The answer is depends, depends on the data model, depends on lots of things. Are you using it for OLAP or OLTP? What's your hardware, what's your user load.

bigpoppa · Post by **bigpoppa** » Thu Dec 04, 2003 9:41 am

For Server Jobs, using hash table lookups has been considerably faster for us than looking up straight to Oracle. I had a job with 6 Oracle lookups and it took 9 hours to complete. I replaced the 6 Oracle lookups with hash files, and the job took 6 minutes.
-BP

dsuser1 · Post by **dsuser1** » Thu Dec 04, 2003 11:28 am

thanks for all the inputs.. let me put more clarity in what i am looking for..
my requirement is to load around 10 millions of record from flat file to UDB database. on the way i will need to get some info from a DB2 table in another mainframe machine. now this DB2 table which is to be looked up is quite big and there can be something like 500 million plus records.
i am planning to use parallel job for this loading proces. now look up is a concern area and i can do this in 2 ways.
1) to directly look up in the db2 table using look up stage in parallel. here as per my understanding the table data will be brought into the datastage parallel server and the matching will be done in ascential server memory. so there will be some time needed for ascential to bring this data across and build it in memory

2) to have a job which will download the data from db2 table in a data set. i can use this dataset as the look up in my loading job. here i will have 2 jobs.

given these 2 approaches which will give me best performance? or any other better approach?

T.I.A

Teej · Post by **Teej** » Thu Dec 04, 2003 3:59 pm

dsuser1 wrote:thanks for all the inputs.. let me put more clarity in what i am looking for..
my requirement is to load around 10 millions of record from flat file to UDB database. on the way i will need to get some info from a DB2 table in another mainframe machine. now this DB2 table which is to be looked up is quite big and there can be something like 500 million plus records.

Oh wow. If you use PX, you MUST wait until 7.0.1 before even attempting to deal with data that size as a lookup link.

Are you POSITIVE that the lookup table can not be narrowed down in any possible ways? I shudder to think of the length of time it would take to go through the entire list on a lookup with a brand new record. Maybe using change capture stage would be wiser (I have not used it, so let me know if it's a worse solution than an optimized lookup stage on data that large.)

1) to directly look up in the db2 table using look up stage in parallel.

2) to have a job which will download the data from db2 table in a data set. i can use this dataset as the look up in my loading job. here i will have 2 jobs.

You definitely need to somehow land this data if you plans to use that table more than once. The very first step I would take with this kind of issue is: Can this lookup table set be made smaller in any possible way?

Then you definitely will have to use hash partitioning, and as many nodes as your server can handle.

Then you also need to simplify this job to focus on just this particular lookup, because this would allow you to really pump up the nodes. As long as you hash your streams for the lookup (which is necessary for anything under 7.0.1, and would be done by default on 7.0.1), everything would be fine.

I had an assumption which was just invalidated -- it does not matter how many nodes a dataset is configurated for, if you are using less (or more!), the data will still come out (although hashing would help reconfigurate it properly for your needs).

I am sure that others will provide a 3rd and 4th approach, because those two will be tough to handle, system-wide.

-T.J.

vzoubov · Post by **vzoubov** » Thu Dec 04, 2003 8:26 pm

dsuser1 wrote:thanks for all the inputs.. let me put more clarity in what i am looking for..
my requirement is to load around 10 millions of record from flat file to UDB database. on the way i will need to get some info from a DB2 table in another mainframe machine. now this DB2 table which is to be looked up is quite big and there can be something like 500 million plus records.
i am planning to use parallel job for this loading proces. now look up is a concern area and i can do this in 2 ways.
1) to directly look up in the db2 table using look up stage in parallel. here as per my understanding the table data will be brought into the datastage parallel server and the matching will be done in ascential server memory. so there will be some time needed for ascential to bring this data across and build it in memory

2) to have a job which will download the data from db2 table in a data set. i can use this dataset as the look up in my loading job. here i will have 2 jobs.

given these 2 approaches which will give me best performance? or any other better approach?

T.I.A

If the lookup dataset is large (doesn't fit the memory) consider using the Merge stage instead of the Lookup.

Vitali.

DSXchange