Duplicates of logical key loaded into dimension table

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Novak
Participant
Posts: 97
Joined: Mon May 21, 2007 10:08 pm
Location: Australia

Duplicates of logical key loaded into dimension table

Post by Novak »

Hi All,

Some help on the best approach for the following requirement please.

We are getting transactional records from our source table and loading the dimension tables.
As per Type 2 standard approach we would find the match between the record in the dimension table and the stream data, and deal with insert/update appropriately.
The slight difficulty arises with many records coming from the source table having the same logical key within the same run. They have to be end dated appropriately before or after the lookup from dimension table so that at the end we have only 1 open dated logical key in the dimension table.

Example:

Source Table -

BusKey|ChangeKey|InputTS

K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12
K2|Data2|2010-07-30 12:12:12
K2|Data3|2010-07-30 13:12:12

Dimension Table before -

SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS

51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|9999-12-31 00:00:00

Dimension Table after -

SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS

51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
52|52|K2|Data1|2010-07-23 09:12:12|2010-07-30 12:12:12
55|52|K2|Data2|2010-07-30 12:12:12|2010-07-30 13:12:12
56|52|K2|Data3|2010-07-30 13:12:12|9999-12-31 00:00:00

You notice the presence of 'OrigSK' field. It is there because our Business Key coming from the source could change and the need for the history remains.
I tried with SCD stage but I am not getting the desired results.

Regards,

Novak
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

What you are looking into doing is "historization" of data. Depending upon your rules it can be simple or quite complex. I have just implemented program code for doing exactly this - the specifications were over 50 pages long and peppered with terms such as "temporal irreducibility" and "backward level changes" and other such concepts.

In our case, because of the complex rules, I ended up writing a program called from the job which computes the historised records. We also have both effective_from/to and valid_from/to timestamp fields to complicate things.

Your example shows that you can have multiple updates for the same key during one run, which certainly complicates matters. Since the logic implies that you need to read all "current" non-terminated dimension values for each source record and you might have to update/insert more than one record in the dimension table for each input record there is no way I can think of to solve it with a single stage in PX. Perhaps my view is biased, but I can't think of an alternative but to write a routine.

IBM has written such a historization operator but the price tag is probably rather high - but if you have a budget for external software you can contact them to see if they would sell it.

The code I've written here is, unfortunately for you, proprietary to the customer and I don't know if they would be willing to part with it; but that is another alternative if you don't feel up to writing routines in either UV/BASIC or c++.
Novak
Participant
Posts: 97
Joined: Mon May 21, 2007 10:08 pm
Location: Australia

Post by Novak »

Thanks for looking into this ArndW.

I kind of expected it would be more than 1 stage. For as long as the number is not ridiculous and the performance does not suffer it should be fine?
Unfortunately, the BASIC code has to be avoided as the client's standards dictate the maximum transparency.

Regards,

Novak
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Novak wrote:the client's standards dictate the maximum transparency
The BASIC language is easily understood - the "B" in "BASIC" stands for "Beginners'" after all.

Resist stupid requirements!

Standards are a fine thing, when used intelligently. Try not to work for Vogons in your next engagement.
Novak wrote:the client's standards dictate the maximum transparency
Does this mean that they insist on using hammers made from glass? :lol:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Vogons? Yikes, please tell me you won't start spouting their poetry next? :shock:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Novak
Participant
Posts: 97
Joined: Mon May 21, 2007 10:08 pm
Location: Australia

Post by Novak »

One day Ray I will be able to choose a client as well :D but right now I am stuck here.
So, nothing you guys can suggest that would be using only the canvas stages?

Regards,

Novak
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

This can be achieved using a Transformer stage with the help of Stage Variables. One thing I am not sure about your process is where exactly your SK's are getting created. Before looking up against the Dimension table or after looking up against the dimension table. Assuming you are using the BusKey to lookup against the dimension table, hash partition the data coming from source table on BusKey and sort (only sort and not partition) on the InputTS in descending order. In the transformer you need to define few Stage Variables and check if the Key value is changing for two consecutive records. If the Key value changes, you need to populate the EndTs with 9999-12-31 00:00:00 and if not you need to use StartTS of the previous record to ENDTS (which should be stored in a different stage variable). The Transformer stage (which is at the output link of the lookupstage) should have two output links in your design. One flow to update/end date the records coming from the source table. Another flow to end date the records from the dimension table. From your example, your output will have 2 records for K1 from source, but will have additional columns from the dimension table. I think the information provided here should get you started. I know what you have to do but its just so late in the day...If you can't we are always here.
Kris

Where's the "Any" key?-Homer Simpson
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

a simple solution will be to sort on key and date in desc order, use stage variable to put date of previous record as enddate of current record if key matches(you can do change detection in stage variables too), and send only 1 record with min date per key value to SCD/upsert/change capture (i prefer change capture), then load all other records to database after that.

don't forget to use hash partitioning on key
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Because you can get multiple updates in one run, a transform stage with stage variables isn't sufficient. One input can generate 2 outputs, the next input might have an overlapping period on the 2 records and thus involve 2 update on the existing records (to terminate them), and two inserts for the new ranges.

I mentioned BASIC because it is simple, in your customer's case you will probably need to write a c++ program.
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

ArndW wrote:Because you can get multiple updates in one run, a transform stage with stage variables isn't sufficient.
A Transformer stage with Stage Variables along with a Remove Duplicates stage should be sufficient as long as we tell the Stage Variables what to do.
One input can generate 2 outputs
Based upon the OP's example, one input record will and can generate only 1 output with additional columns from the reference link (Dimension table) as long as you are reading data from the Dimension table where EndTS = 9999-12-31 00:00:00 (this is my assumption which most likely what the OP is doing). Taking OP's example, this is how the output of the LookUp Stage would look for BusKey = K1 on the source table --

Code: Select all

BusKey|ChangeKey|InputTS|SK|OrigSK|Dimension.BusKey|Dimension.ChangeKey|Dimension.StartTS|Dimension.EndTS 
K1|Data2|2010-07-30 09:12:12|51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00 
K1|Data3|2010-07-30 11:12:12|51|51|K1|Data1|2010-07-19 09:12:12|9999-12-31 00:00:00 
Once we feed the output link of the LookUp Stage to a transformer stage we can define some stage variables. The important thing here is to hash partition and sort the data on the SourceTable.BusKey and sort only on SourceTable.StartTS in descending order.

Code: Select all

svNewBusKey = SourceTable.BusKey
svOldBusKey = svNewBusKey
svEndTs = If svOldBusKey <> svNewBusKey Then '9999-12-31 00:00:00' 
          Else svOldEndTS
svOldEndTs = SourceTable.InputTS
At this point, I would define two output links for the transformer stage. Since the format of the Output data is

Code: Select all

SK|OrigSK|BusKey|ChangeKey|StartTS|EndTS 
All we need to do here is map the columns from the source table to the output link and set the value of EndTS = stagevariable svEndTS. This will produce output records

Code: Select all

54|51|K1|Data3|2010-07-30 11:12:12|9999-12-31 00:00:00
53|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12 
On the second output link of the Transformer, map the columns from the Dimension Table to the output link and map SourceTable.InputTS to the EndTS on the output link. This will produce output records

Code: Select all

51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 11:12:12
You just need to feed the second output link to a Remove Duplicates stage and retain the required record based on how you sort the data on the EndTS field.

Now finally, you just funnel the first output link from the Transformer Stage and output link from Remove Duplicates Stage and you will have the required result.

Hope that helps.
Kris

Where's the "Any" key?-Homer Simpson
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Kris - that is a good analysis of the problem and makes sense for the example given, which presupposes that every new record has an effective "EndTS" of HIGH-DATE. If that is not the case then the logic gets complex beyond the scope of stage variables.
HariK
Participant
Posts: 68
Joined: Thu May 17, 2007 1:50 am

Post by HariK »

I am considering Buskey 51 for example

BusKey|ChangeKey|StartTS
K1|Data2|2010-07-30 09:12:12
K1|Data3|2010-07-30 11:12:12

union with existing dimension data

BusKey|ChangeKey|StartTS
K1|Data1|2010-07-19 09:12:12

if required add remove duplicates stage here based on data analysis.

then sort on Buskey(asc) and StartTS(desc) output should be like
BusKey|ChangeKey|StartTS
K1|Data3|2010-07-30 11:12:12
K1|Data2|2010-07-30 09:12:12
K1|Data1|2010-07-19 09:12:12

Next step would be to identify OrigSK which can be achieved by lookup(match only on BusKey do not reject any data), output should be like
OrigSK|BusKey|ChangeKey|StartTS
51|K1|Data3|2010-07-30 11:12:12
51|K1|Data2|2010-07-30 09:12:12
51|K1|Data1|2010-07-19 09:12:12

The EndTS can be calculated in transformer using Stage Variable to traverse the previous row values to current row.

|51|K1|Data3|2010-07-30 11:12:12|highdate
|51|K1|Data2|2010-07-30 09:12:12|2010-07-30 11:12:12
51|51|K1|Data1|2010-07-19 09:12:12|2010-07-30 09:12:12


Next step would be to identify SK which can be achieved by lookup(match on BusKey,StartTS, reject if no match), output should be like

Output to target
SK|OrigSK|BusKey|ChangeKey|StartTS
51|51|K1|Data1|2010-07-19 09:12:12
Reject
51|K1|Data3|2010-07-30 11:12:12
51|K1|Data2|2010-07-30 09:12:12

The reject should go thruugh key generator on combination of BusKey and StartTS. insert to target

All the records in dimension table would be processed everyday. which may be of concern if data volumes are high.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

ArndW wrote:which presupposes that every new record has an effective "EndTS" of HIGH-DATE. If that is not the case then the log ...
even if its not there you can take only the active version or Null endTS. I think i rekon what you want to say but it seems a bit different from the example given(correct me otherwise). Hence I will wait for OP to clarify whether the solution provided is sufficient or not. Else will have to go other way.

It does have an assumption of having target data in correct format as the output is supposed to be and then using this logic to feed data to that, or building it from scratch.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Novak
Participant
Posts: 97
Joined: Mon May 21, 2007 10:08 pm
Location: Australia

Post by Novak »

Hi guys,

Appreciate the sensible inputs. Unfortunately I will not get the chance to try them until monday. Will let you know how I go then

Regards,

Novak
Post Reply