Duplicates of logical key loaded into dimension table
Moderators: chulett, rschirm, roy
Duplicates of logical key loaded into dimension table
Hi All,
Some help on the best approach for the following requirement please.
We are getting transactional records from our source table and loading the dimension tables.
As per Type 2 standard approach we would find the match between the record in the dimension table and the stream data, and deal with insert/update appropriately.
The slight difficulty arises with many records coming from the source table having the same logical key within the same run. They have to be end dated appropriately before or after the lookup from dimension table so that at the end we have only 1 open dated logical key in the dimension table.
Example:
Source Table -
BusKey|ChangeKey|InputTS
K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12
K2|Data2|2010-07-30 12:12:12
K2|Data3|2010-07-30 13:12:12
Dimension Table before -
SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS
51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|9999-12-31 00:00:00
Dimension Table after -
SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|2010-07-30 12:12:12
55|52|K2|Data2|2010-07-30 12:12:12|2010-07-30 13:12:12
56|52|K2|Data3|2010-07-30 13:12:12|9999-12-31 00:00:00
You notice the presence of 'OrigSK' field. It is there because our Business Key coming from the source could change and the need for the history remains.
I tried with SCD stage but I am not getting the desired results.
Regards,
Novak
Some help on the best approach for the following requirement please.
We are getting transactional records from our source table and loading the dimension tables.
As per Type 2 standard approach we would find the match between the record in the dimension table and the stream data, and deal with insert/update appropriately.
The slight difficulty arises with many records coming from the source table having the same logical key within the same run. They have to be end dated appropriately before or after the lookup from dimension table so that at the end we have only 1 open dated logical key in the dimension table.
Example:
Source Table -
BusKey|ChangeKey|InputTS
K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12
K2|Data2|2010-07-30 12:12:12
K2|Data3|2010-07-30 13:12:12
Dimension Table before -
SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS
51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|9999-12-31 00:00:00
Dimension Table after -
SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|2010-07-30 12:12:12
55|52|K2|Data2|2010-07-30 12:12:12|2010-07-30 13:12:12
56|52|K2|Data3|2010-07-30 13:12:12|9999-12-31 00:00:00
You notice the presence of 'OrigSK' field. It is there because our Business Key coming from the source could change and the need for the history remains.
I tried with SCD stage but I am not getting the desired results.
Regards,
Novak
What you are looking into doing is "historization" of data. Depending upon your rules it can be simple or quite complex. I have just implemented program code for doing exactly this - the specifications were over 50 pages long and peppered with terms such as "temporal irreducibility" and "backward level changes" and other such concepts.
In our case, because of the complex rules, I ended up writing a program called from the job which computes the historised records. We also have both effective_from/to and valid_from/to timestamp fields to complicate things.
Your example shows that you can have multiple updates for the same key during one run, which certainly complicates matters. Since the logic implies that you need to read all "current" non-terminated dimension values for each source record and you might have to update/insert more than one record in the dimension table for each input record there is no way I can think of to solve it with a single stage in PX. Perhaps my view is biased, but I can't think of an alternative but to write a routine.
IBM has written such a historization operator but the price tag is probably rather high - but if you have a budget for external software you can contact them to see if they would sell it.
The code I've written here is, unfortunately for you, proprietary to the customer and I don't know if they would be willing to part with it; but that is another alternative if you don't feel up to writing routines in either UV/BASIC or c++.
In our case, because of the complex rules, I ended up writing a program called from the job which computes the historised records. We also have both effective_from/to and valid_from/to timestamp fields to complicate things.
Your example shows that you can have multiple updates for the same key during one run, which certainly complicates matters. Since the logic implies that you need to read all "current" non-terminated dimension values for each source record and you might have to update/insert more than one record in the dimension table for each input record there is no way I can think of to solve it with a single stage in PX. Perhaps my view is biased, but I can't think of an alternative but to write a routine.
IBM has written such a historization operator but the price tag is probably rather high - but if you have a budget for external software you can contact them to see if they would sell it.
The code I've written here is, unfortunately for you, proprietary to the customer and I don't know if they would be willing to part with it; but that is another alternative if you don't feel up to writing routines in either UV/BASIC or c++.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Thanks for looking into this ArndW.
I kind of expected it would be more than 1 stage. For as long as the number is not ridiculous and the performance does not suffer it should be fine?
Unfortunately, the BASIC code has to be avoided as the client's standards dictate the maximum transparency.
Regards,
Novak
I kind of expected it would be more than 1 stage. For as long as the number is not ridiculous and the performance does not suffer it should be fine?
Unfortunately, the BASIC code has to be avoided as the client's standards dictate the maximum transparency.
Regards,
Novak
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The BASIC language is easily understood - the "B" in "BASIC" stands for "Beginners'" after all.Novak wrote:the client's standards dictate the maximum transparency
Resist stupid requirements!
Standards are a fine thing, when used intelligently. Try not to work for Vogons in your next engagement.
Does this mean that they insist on using hammers made from glass?Novak wrote:the client's standards dictate the maximum transparency
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
This can be achieved using a Transformer stage with the help of Stage Variables. One thing I am not sure about your process is where exactly your SK's are getting created. Before looking up against the Dimension table or after looking up against the dimension table. Assuming you are using the BusKey to lookup against the dimension table, hash partition the data coming from source table on BusKey and sort (only sort and not partition) on the InputTS in descending order. In the transformer you need to define few Stage Variables and check if the Key value is changing for two consecutive records. If the Key value changes, you need to populate the EndTs with 9999-12-31 00:00:00 and if not you need to use StartTS of the previous record to ENDTS (which should be stored in a different stage variable). The Transformer stage (which is at the output link of the lookupstage) should have two output links in your design. One flow to update/end date the records coming from the source table. Another flow to end date the records from the dimension table. From your example, your output will have 2 records for K1 from source, but will have additional columns from the dimension table. I think the information provided here should get you started. I know what you have to do but its just so late in the day...If you can't we are always here.
Kris
Where's the "Any" key?-Homer Simpson
Where's the "Any" key?-Homer Simpson
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
a simple solution will be to sort on key and date in desc order, use stage variable to put date of previous record as enddate of current record if key matches(you can do change detection in stage variables too), and send only 1 record with min date per key value to SCD/upsert/change capture (i prefer change capture), then load all other records to database after that.
don't forget to use hash partitioning on key
don't forget to use hash partitioning on key
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.
Because you can get multiple updates in one run, a transform stage with stage variables isn't sufficient. One input can generate 2 outputs, the next input might have an overlapping period on the 2 records and thus involve 2 update on the existing records (to terminate them), and two inserts for the new ranges.
I mentioned BASIC because it is simple, in your customer's case you will probably need to write a c++ program.
I mentioned BASIC because it is simple, in your customer's case you will probably need to write a c++ program.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
A Transformer stage with Stage Variables along with a Remove Duplicates stage should be sufficient as long as we tell the Stage Variables what to do.ArndW wrote:Because you can get multiple updates in one run, a transform stage with stage variables isn't sufficient.
Based upon the OP's example, one input record will and can generate only 1 output with additional columns from the reference link (Dimension table) as long as you are reading data from the Dimension table where EndTS = 9999-12-31 00:00:00 (this is my assumption which most likely what the OP is doing). Taking OP's example, this is how the output of the LookUp Stage would look for BusKey = K1 on the source table --One input can generate 2 outputs
Code: Select all
BusKey|ChangeKey|InputTS|SK|OrigSK|Dimension.BusKey|Dimension.ChangeKey|Dimension.StartTS|Dimension.EndTS
K1|Data2|2010-07-30 09:12:12|51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00
K1|Data3|2010-07-30 11:12:12|51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00
Code: Select all
svNewBusKey = SourceTable.BusKey
svOldBusKey = svNewBusKey
svEndTs = If svOldBusKey <> svNewBusKey Then '9999-12-31 00:00:00'
Else svOldEndTS
svOldEndTs = SourceTable.InputTS
Code: Select all
SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS
Code: Select all
54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
Code: Select all
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 11:12:12
Now finally, you just funnel the first output link from the Transformer Stage and output link from Remove Duplicates Stage and you will have the required result.
Hope that helps.
Kris
Where's the "Any" key?-Homer Simpson
Where's the "Any" key?-Homer Simpson
Kris - that is a good analysis of the problem and makes sense for the example given, which presupposes that every new record has an effective "EndTS" of HIGH-DATE. If that is not the case then the logic gets complex beyond the scope of stage variables.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I am considering Buskey 51 for example
BusKey|ChangeKey|StartTS
K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12
union with existing dimension data
BusKey|ChangeKey|StartTS
K1|Data1|2010-07-19 09:12:12
if required add remove duplicates stage here based on data analysis.
then sort on Buskey(asc) and StartTS(desc) output should be like
BusKey|ChangeKey|StartTS
K1|Data3|2010-07-30 11:12:12
K1|Data2|2010-07-30 09:12:12
K1|Data1|2010-07-19 09:12:12
Next step would be to identify OrigSK which can be achieved by lookup(match only on BusKey do not reject any data), output should be like
OrigSK|BusKey|ChangeKey|StartTS
51|K1|Data3|2010-07-30 11:12:12
51|K1|Data2|2010-07-30 09:12:12
51|K1|Data1|2010-07-19 09:12:12
The EndTS can be calculated in transformer using Stage Variable to traverse the previous row values to current row.
|51|K1|Data3|2010-07-30 11:12:12|highdate
|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
Next step would be to identify SK which can be achieved by lookup(match on BusKey,StartTS, reject if no match), output should be like
Output to target
SK|OrigSK|BusKey|ChangeKey|StartTS
51|51|K1|Data1|2010-07-19 09:12:12
Reject
51|K1|Data3|2010-07-30 11:12:12
51|K1|Data2|2010-07-30 09:12:12
The reject should go thruugh key generator on combination of BusKey and StartTS. insert to target
All the records in dimension table would be processed everyday. which may be of concern if data volumes are high.
BusKey|ChangeKey|StartTS
K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12
union with existing dimension data
BusKey|ChangeKey|StartTS
K1|Data1|2010-07-19 09:12:12
if required add remove duplicates stage here based on data analysis.
then sort on Buskey(asc) and StartTS(desc) output should be like
BusKey|ChangeKey|StartTS
K1|Data3|2010-07-30 11:12:12
K1|Data2|2010-07-30 09:12:12
K1|Data1|2010-07-19 09:12:12
Next step would be to identify OrigSK which can be achieved by lookup(match only on BusKey do not reject any data), output should be like
OrigSK|BusKey|ChangeKey|StartTS
51|K1|Data3|2010-07-30 11:12:12
51|K1|Data2|2010-07-30 09:12:12
51|K1|Data1|2010-07-19 09:12:12
The EndTS can be calculated in transformer using Stage Variable to traverse the previous row values to current row.
|51|K1|Data3|2010-07-30 11:12:12|highdate
|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
Next step would be to identify SK which can be achieved by lookup(match on BusKey,StartTS, reject if no match), output should be like
Output to target
SK|OrigSK|BusKey|ChangeKey|StartTS
51|51|K1|Data1|2010-07-19 09:12:12
Reject
51|K1|Data3|2010-07-30 11:12:12
51|K1|Data2|2010-07-30 09:12:12
The reject should go thruugh key generator on combination of BusKey and StartTS. insert to target
All the records in dimension table would be processed everyday. which may be of concern if data volumes are high.
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
even if its not there you can take only the active version or Null endTS. I think i rekon what you want to say but it seems a bit different from the example given(correct me otherwise). Hence I will wait for OP to clarify whether the solution provided is sufficient or not. Else will have to go other way.ArndW wrote:which presupposes that every new record has an effective "EndTS" of HIGH-DATE. If that is not the case then the log ...
It does have an assumption of having target data in correct format as the output is supposed to be and then using this logic to feed data to that, or building it from scratch.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.