DataStage functionality to perform a lookup for a type 2 dim

eldonp · Post by **eldonp** » Thu Dec 25, 2003 11:49 am

Does anybody know how to do a lookup against a type 2 dimension within a transformer?

If done in sql, the where clause would use a bewteen e.g
where source.tran_date bewteen lookup.from_date and lookup.to_date
and source.id = lookup.id

I have a lookup with a composite primary key on id, from_date and to_date

The main input link has an id, tran_date and the facts.

I need to use this to load history data (we cannot do incremental or delta loads) and need to find the correct wh_key values.

I could email you a dsx file and some sample data. My email address is eldonp@lantic.net

kcbland · Post by **kcbland** » Thu Dec 25, 2003 2:00 pm

You could checkout this post:
viewtopic.php?t=84998&highlight=scd

A type 2 can be implemented multiple ways in a warehouse. Since your variants have a start and end effective date, you have to be careful that you don't get sucked into the believe that a SQL style solution solution is the answer. If you're going to do RDBMS reference lookups, you're going to use some of the worst performing SQL ever invented (BETWEEN). This is right in front of subselect aggregation used when you run an Inmon atomic level warehouse doing SCD's on all tables, with no END_EFFECTIVE_DT, because you're not allowed to ever update rows so you can't go back and stamp a row's end effective date. You can use the same solution for both, as long as you agree that the beginning effective date for a variant is always the ending effective date (minus 1 millisecond/second/day) of the previous variant.

If you're running a bunch of jobs doing the same lookups against the same table, you'll run into DB congestion and saturation. If you're doing divde-n-conquer job instantiation to tackle a large amount of source data, then you will have a bunch of instances all hitting the same reference lookups over and over, and if they're in a database, and you're using the worst performing SQL, need I go on?

So, you should attempt to utilize hash lookups. The methods and routines/functions are actual production routines in use for over four years by various clients of mine. This is highest performant lookup solution for Server jobs that I have found, with the least amount of maintenance. You don't have to use UV/ODBC stages with secondary indexes, which still are slow.

eldonp · Post by **eldonp** » Mon Dec 29, 2003 7:17 am

Thanks Kenneth. Perhaps I was a bit unclear in my initial post. I have a type2 dimension with delimitting dates and would like to do a lookup against it for populating a fact table.

Say, for example, I have a hashed file with the data from the dimension. A record is made unique by the combination of natural key, from date and to date. With these values, I need to find the correct warehouse dimension key.

The source data, however, has only a transaction date, natural key, then the factual values.

I am performing the lookup inside a transformer stage. I link from source natural key to lookup natural key, but need to link the source transaction date to both the (lookup)from date and to date using a function similar to the sql between (or >=, <=).

As far as I can see, I am not allowed to do this kind of thing in a transformer without possible creating a transform or routine......

kcbland wrote:You could checkout this post:
viewtopic.php?t=84998&highlight=scd

A type 2 can be implemented multiple ways in a warehouse. Since your variants have a start and end effective date, you have to be careful that you don't get sucked into the believe that a SQL style solution solution is the answer. If you're going to do RDBMS reference lookups, you're going to use some of the worst performing SQL ever invented (BETWEEN). This is right in front of subselect aggregation used when you run an Inmon atomic level warehouse doing SCD's on all tables, with no END_EFFECTIVE_DT, because you're not allowed to ever update rows so you can't go back and stamp a row's end effective date. You can use the same solution for both, as long as you agree that the beginning effective date for a variant is always the ending effective date (minus 1 millisecond/second/day) of the previous variant.

If you're running a bunch of jobs doing the same lookups against the same table, you'll run into DB congestion and saturation. If you're doing divde-n-conquer job instantiation to tackle a large amount of source data, then you will have a bunch of instances all hitting the same reference lookups over and over, and if they're in a database, and you're using the worst performing SQL, need I go on?

So, you should attempt to utilize hash lookups. The methods and routines/functions are actual production routines in use for over four years by various clients of mine. This is highest performant lookup solution for Server jobs that I have found, with the least amount of maintenance. You don't have to use UV/ODBC stages with secondary indexes, which still are slow.

chulett · Post by **chulett** » Mon Dec 29, 2003 8:21 am

No, I think you were clear... and Ken answered your question. Check out the post he linked to - it details a highly performant methodology to use Hash Files to do exactly what you asked.

However, that being said - if you would like to fall back on the least efficient way to do this

you can utilize custom sql in the lookup stage (OCI, ODBC, whatever) pass in the key variables that you need and do the "between" sql back to your source database as you've mentioned.

It's just that there are better, more "DataStage-centric" ways to accomplish the same thing that Ken is trying to point out to you. Yes, it's a little different than just doing a "normal" lookup and yes it requires a routine or two and some preparation to setup, but if you are doing anything with any sort of volume it's well worth checking out as the fastest way to do what you need to do.

kcbland · Post by **kcbland** » Mon Dec 29, 2003 9:57 am

eldonp wrote:Thanks Kenneth. Perhaps I was a bit unclear in my initial post. I have a type2 dimension with delimitting dates and would like to do a lookup against it for populating a fact table.

Yep, I'm with you so far.

eldonp wrote: Say, for example, I have a hashed file with the data from the dimension. A record is made unique by the combination of natural key, from date and to date. With these values, I need to find the correct warehouse dimension key.

Yep, unfortunately, your hashed file can only be queried using a UV/ODBC stage. Your reference will not be a hash reference lookup, but a SQL ranging query.

eldonp wrote:The source data, however, has only a transaction date, natural key, then the factual values.

I am performing the lookup inside a transformer stage. I link from source natural key to lookup natural key, but need to link the source transaction date to both the (lookup)from date and to date using a function similar to the sql between (or >=, <=).

Yes, and I told you to let go of your SQL mindset. You started out giving a functional requirement, but now you've qualified it with a technical gotcha.

Your requirement is find a surrogate key assigned for a natural key where the surrogate key was effective for a business date on your fact record. I gave you more than you were asking, because I wanted you to realize there are more issues at stake. I could have just flipped off an answer of use an OCI/ODBC reference lookup and use USER-DEFINED SQL to specify the BETWEEN criteria. However, that goes against just about everything I recommend in ETL. It's not performant, it doesn't scale, it requires a current table, it's not friendly to highly-instantiated jobs because of congestion, blah blah blah.

You don't need an END effective date. SCD's never should have gaps. Therefore, the begin date is the end date of the previous variant. That being true, please go read the link I highlighted. I'm trying to explain an alternative solution to SQL that still does a ranged lookup (in a function), but uses the native 1-to-1 performance benefit of a hash lookup. Not to mention hash read caching, in-memory shared hash caching, scales to instantiation (no extra copies in memory, no saturation of network i/o or DB calls).

This technique was taught as part of Ascentials DS306 data warehousing ETL course. It's rather simple and extremely fast.