Duplicates of logical key loaded into dimension table

Novak · Post by **Novak** » Tue Aug 03, 2010 10:58 pm

Hi All,

Some help on the best approach for the following requirement please.

We are getting transactional records from our source table and loading the dimension tables.
As per Type 2 standard approach we would find the match between the record in the dimension table and the stream data, and deal with insert/update appropriately.
The slight difficulty arises with many records coming from the source table having the same logical key within the same run. They have to be end dated appropriately before or after the lookup from dimension table so that at the end we have only 1 open dated logical key in the dimension table.

Example:

Source Table -

BusKey|ChangeKey|InputTS

K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12
K2|Data2|2010-07-30 12:12:12
K2|Data3|2010-07-30 13:12:12

Dimension Table before -

SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS

51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|9999-12-31 00:00:00

Dimension Table after -

SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS

51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|2010-07-30 12:12:12
55|52|K2|Data2|2010-07-30 12:12:12|2010-07-30 13:12:12
56|52|K2|Data3|2010-07-30 13:12:12|9999-12-31 00:00:00

You notice the presence of 'OrigSK' field. It is there because our Business Key coming from the source could change and the need for the history remains.
I tried with SCD stage but I am not getting the desired results.

Regards,

Novak

ArndW · Post by **ArndW** » Wed Aug 04, 2010 2:11 am

What you are looking into doing is "historization" of data. Depending upon your rules it can be simple or quite complex. I have just implemented program code for doing exactly this - the specifications were over 50 pages long and peppered with terms such as "temporal irreducibility" and "backward level changes" and other such concepts.

In our case, because of the complex rules, I ended up writing a program called from the job which computes the historised records. We also have both effective_from/to and valid_from/to timestamp fields to complicate things.

Your example shows that you can have multiple updates for the same key during one run, which certainly complicates matters. Since the logic implies that you need to read all "current" non-terminated dimension values for each source record and you might have to update/insert more than one record in the dimension table for each input record there is no way I can think of to solve it with a single stage in PX. Perhaps my view is biased, but I can't think of an alternative but to write a routine.

IBM has written such a historization operator but the price tag is probably rather high - but if you have a budget for external software you can contact them to see if they would sell it.

The code I've written here is, unfortunately for you, proprietary to the customer and I don't know if they would be willing to part with it; but that is another alternative if you don't feel up to writing routines in either UV/BASIC or c++.

Novak · Post by **Novak** » Wed Aug 04, 2010 5:36 pm

Thanks for looking into this ArndW.

I kind of expected it would be more than 1 stage. For as long as the number is not ridiculous and the performance does not suffer it should be fine?
Unfortunately, the BASIC code has to be avoided as the client's standards dictate the maximum transparency.

Regards,

Novak

ray.wurlod · Post by **ray.wurlod** » Wed Aug 04, 2010 5:47 pm

Novak wrote:the client's standards dictate the maximum transparency

The BASIC language is easily understood - the "B" in "BASIC" stands for "Beginners'" after all.

Resist stupid requirements!

Standards are a fine thing, when used intelligently. Try not to work for Vogons in your next engagement.

Novak wrote:the client's standards dictate the maximum transparency

Does this mean that they insist on using hammers made from glass?

chulett · Post by **chulett** » Wed Aug 04, 2010 6:15 pm

Vogons? Yikes, please tell me you won't start spouting their poetry next?

Novak · Post by **Novak** » Wed Aug 04, 2010 9:13 pm

One day Ray I will be able to choose a client as well :D but right now I am stuck here.
So, nothing you guys can suggest that would be using only the canvas stages?

Regards,

Novak

kris007 · Post by **kris007** » Wed Aug 04, 2010 9:31 pm

This can be achieved using a Transformer stage with the help of Stage Variables. One thing I am not sure about your process is where exactly your SK's are getting created. Before looking up against the Dimension table or after looking up against the dimension table. Assuming you are using the BusKey to lookup against the dimension table, hash partition the data coming from source table on BusKey and sort (only sort and not partition) on the InputTS in descending order. In the transformer you need to define few Stage Variables and check if the Key value is changing for two consecutive records. If the Key value changes, you need to populate the EndTs with 9999-12-31 00:00:00 and if not you need to use StartTS of the previous record to ENDTS (which should be stored in a different stage variable). The Transformer stage (which is at the output link of the lookupstage) should have two output links in your design. One flow to update/end date the records coming from the source table. Another flow to end date the records from the dimension table. From your example, your output will have 2 records for K1 from source, but will have additional columns from the dimension table. I think the information provided here should get you started. I know what you have to do but its just so late in the day...If you can't we are always here.

priyadarshikunal · Post by **priyadarshikunal** » Thu Aug 05, 2010 7:52 am

a simple solution will be to sort on key and date in desc order, use stage variable to put date of previous record as enddate of current record if key matches(you can do change detection in stage variables too), and send only 1 record with min date per key value to SCD/upsert/change capture (i prefer change capture), then load all other records to database after that.

don't forget to use hash partitioning on key

ArndW · Post by **ArndW** » Thu Aug 05, 2010 9:50 am

Because you can get multiple updates in one run, a transform stage with stage variables isn't sufficient. One input can generate 2 outputs, the next input might have an overlapping period on the 2 records and thus involve 2 update on the existing records (to terminate them), and two inserts for the new ranges.

I mentioned BASIC because it is simple, in your customer's case you will probably need to write a c++ program.

kris007 · Post by **kris007** » Thu Aug 05, 2010 8:09 pm

ArndW wrote:Because you can get multiple updates in one run, a transform stage with stage variables isn't sufficient.

A Transformer stage with Stage Variables along with a Remove Duplicates stage should be sufficient as long as we tell the Stage Variables what to do.

One input can generate 2 outputs

Based upon the OP's example, one input record will and can generate only 1 output with additional columns from the reference link (Dimension table) as long as you are reading data from the Dimension table where EndTS = 9999-12-31 00:00:00 (this is my assumption which most likely what the OP is doing). Taking OP's example, this is how the output of the LookUp Stage would look for BusKey = K1 on the source table --

Code: Select all

BusKey|ChangeKey|InputTS|SK|OrigSK|Dimension.BusKey|Dimension.ChangeKey|Dimension.StartTS|Dimension.EndTS 
K1|Data2|2010-07-30 09:12:12|51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00 
K1|Data3|2010-07-30 11:12:12|51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00

Once we feed the output link of the LookUp Stage to a transformer stage we can define some stage variables. The important thing here is to hash partition and sort the data on the SourceTable.BusKey and sort only on SourceTable.StartTS in descending order.

Code: Select all

svNewBusKey = SourceTable.BusKey
svOldBusKey = svNewBusKey
svEndTs = If svOldBusKey <> svNewBusKey Then '9999-12-31 00:00:00' 
          Else svOldEndTS
svOldEndTs = SourceTable.InputTS

At this point, I would define two output links for the transformer stage. Since the format of the Output data is

Code: Select all

SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS

All we need to do here is map the columns from the source table to the output link and set the value of EndTS = stagevariable svEndTS. This will produce output records

Code: Select all

54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12

On the second output link of the Transformer, map the columns from the Dimension Table to the output link and map SourceTable.InputTS to the EndTS on the output link. This will produce output records

Code: Select all

51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 11:12:12

You just need to feed the second output link to a Remove Duplicates stage and retain the required record based on how you sort the data on the EndTS field.

Now finally, you just funnel the first output link from the Transformer Stage and output link from Remove Duplicates Stage and you will have the required result.

Hope that helps.

ArndW · Post by **ArndW** » Fri Aug 06, 2010 1:50 am

Kris - that is a good analysis of the problem and makes sense for the example given, which presupposes that every new record has an effective "EndTS" of HIGH-DATE. If that is not the case then the logic gets complex beyond the scope of stage variables.

HariK · Post by **HariK** » Fri Aug 06, 2010 3:05 am

priyadarshikunal · Post by **priyadarshikunal** » Fri Aug 06, 2010 3:54 am

ArndW wrote:which presupposes that every new record has an effective "EndTS" of HIGH-DATE. If that is not the case then the log ...

even if its not there you can take only the active version or Null endTS. I think i rekon what you want to say but it seems a bit different from the example given(correct me otherwise). Hence I will wait for OP to clarify whether the solution provided is sufficient or not. Else will have to go other way.

It does have an assumption of having target data in correct format as the output is supposed to be and then using this logic to feed data to that, or building it from scratch.

Novak · Post by **Novak** » Fri Aug 06, 2010 8:39 am

Hi guys,

Appreciate the sensible inputs. Unfortunately I will not get the chance to try them until monday. Will let you know how I go then

Regards,

Novak