Discussion about Late Arriving Dimensions

kduke · Post by **kduke** » Tue Apr 19, 2011 8:06 pm

I have always used -1 and 0 and preloaded them in to my dimension tables. That way when the lookup failed then you switch to -1 or 0 based on design. Sometimes 0 is unknown and -1 is late arriving. Specific industries have more of these issues than others like healthcare insurance. PPO are the worst because they do not load members until they have a claim on them. So until a child has a doctor's visit he is not in the system. Sometimes they delay loading the child until all business rules are validated like age and relationship verified manually through a benefits admin. Complicates the loading of the fact table. A claim may get paid and later denied because the child is not eligible.

Sometimes there are multiple systems involved like membership is a different system than claims. So claims are paid and they become facts. So the claims system has a member number that is different than the membership member id. Basically there are 2 dimensions for members. One for each system. At some point the member is unknown and/or late arriving. Usually there is a clean up step to identify these after the load. That is where it gets messy.

kduke · Post by **kduke** » Tue Apr 19, 2011 8:11 pm

I never want to call a stored procedure during ETL. I have lost control. I am now reliant on an outside process to load data. It is also database specific. The purpose of ETL tools is partly to be database independent.

I always preload -1 and 0 just for performance.

Please give an example of your late arriving dimensions. Always want to know the business reason for something out of the norm.

DSguru2B · Post by **DSguru2B** » Wed Apr 20, 2011 7:06 am

But how would you decide if there are more than one missing value. Do you keep decrementing the negative number? Once true dimension values do arrive, how do you differentiate that -1 is "Parsley" and -2 is "Tomatoes"?

chulett · Post by **chulett** » Wed Apr 20, 2011 7:16 am

My current (non-DataStage) project is taking a "pre-load" approach as well. After all of the dimensions have been updated from their sources, the incoming daily fact data is thrown up against each dimension and we create "stub" records for anything late arriving. They get a real surrogate key and as much data as they can from the fact data but are marked in an attribute column as a stub so we know their source. Hopefully later the "real" data will arrive (all dimensions are type 2) but in the meantime we have the RI we need.

To address your concern about checking 'all' of the fact data, we're just checking against the new fact data that arrived that day. Even when that volume is 'large' the amount of time it takes to pull distinct values from the incremental source data is minimal.

We also have a "-1" record in all appropriate dimensions for the unknown or "NA" joins against the fact where the source column in the fact is "empty".

DSguru2B · Post by **DSguru2B** » Wed Apr 20, 2011 7:24 am

Maybe I did not take my smart pills today but even if you mark these records as stub records and assign a true surrogate key without a natural key(the dimension code), how are you later going to identify and assign the right stub record with its correct dimensional value.
Sorry OP for hijacking your thread, I am just interested on how this process works.

chulett · Post by **chulett** » Wed Apr 20, 2011 7:27 am

We have the natural key from the fact data. In many cases that's all we have, making the stub dimension records extremely... stubby.

evee1 · Post by **evee1** » Wed Apr 20, 2011 5:34 pm

DSguru2B wrote:Maybe I did not take my smart pills today but even if you mark these records as stub records and assign a true surrogate key without a natural key(the dimension code), how are you later going to identify and assign the right stub record with its correct dimensional value.
Sorry OP for hijacking your thread, I am just interested on how this process works.

I have opened this thread to learn about and discuss various approached, so its pursose is educational rather than just to solve a certain issue. So you are welcome to "hijack" it and share your thoughts. I hope this is OK on this forum :D.

As for you questions, in the case of "generic" -1 dimension record (where you have only one dummy -1 for all the missing "Pumpkin", "Gorgonzola" and "Walnuts"), you have to store natural key in the fact table if you want to update the fact with the surogate key of the dimension that has (eventually) arrived. The updates of the fact become somewhat convoluted as well, expecially in the case of Type 2 facts.
Also, in my opinion, misses the point of having dimensions and facts altogether, as you store natural key in both of them.

chulett wrote:To address your concern about checking 'all' of the fact data, we're just checking against the new fact data that arrived that day. Even when that volume is 'large' the amount of time it takes to pull distinct values from the incremental source data is minimal.

I agree. Unfortunately the 600 milion rows I have mentioned before is the new fact data - it's a sales forcast for a month. That's why I try to find the solution that would aviod pre-scanning of this data. But in the end the pre-load solution may end up being the most efficient after all.

kduke · Post by **kduke** » Wed Apr 20, 2011 7:33 pm

DSguru2B

Usually there is a natural key in the fact to relate rows in the target back to the source. So in my example you might have a claim number. You have to go back to the source where you have to check to see if these unknowns have arrived. So basically you have to reprocess these rows. So you get claim numbers from your fact table. Probably load them into a temp table on the source and use you normal job to process these rows except the SQL now joins the temp table to the claims table. A lot of sites put the natural keys in shadow tables but the concept is the same only you have to join the fact table to the fact shadow table. Usually the keys for a fact or dimension table is the same surrogate for the shadow table. This makes you target tables a little narrower. I find the end users are very comfortable seeing their natural keys in the warehouse tables. The trouble is they use them in where clauses too much which kills performance unless you put indexes on these columns. Multiple sources complicates this a lot.

You are correct there is no way to fix this with the information in the data warehouse. Maybe Craig's stub idea is the same as what I described. I am not sure.

vmcburney · Post by **vmcburney** » Thu Apr 21, 2011 12:05 am

My normal method is to set the dimension -1 however it depends on how you want to go back and fix the mess later when the real data comes through. If you have a natural key and no other attributes then the benefit of creating a new dimension row with empty values is that you can fix up that dimension later with a standard slowly changing dimension job that keeps that dimension up to date. You do not have to go back and fix any fact rows. If you set the dimension to -1 then you need a slowly changing fact job - a job that goes back and corrects old fact records when new dimension rows arrive and repoints those old fact rows to the new dimension rows with the corrected surrogate key values.

On very large fact tables it is best to avoid having to go back and modify old fact records. On Fact tables that are supposed to be modified such as accumulating fact tables it may be okay to use the -1 dimension approach.

chulett · Post by **chulett** » Thu Apr 21, 2011 6:49 am

Right, we had zero desire for any need to go back and modify old fact records for late arriving dimensions.

DSguru2B · Post by **DSguru2B** » Thu Apr 21, 2011 6:56 am

kduke wrote:Usually there is a natural key in the fact to relate rows in the target back to the source.

Thats reprocessing. That makes more sense now. The world makes sense again

evee1 · Post by **evee1** » Tue May 03, 2011 12:33 am

Thanks you all for our contribution.