Populating Fact and Dimension Tables

Vipin · Post by **Vipin** » Wed Nov 05, 2003 2:45 am

Hi,

I have a question just for the sake of curiosity and knowlwdge.

Suppose I have populated my fact table and dimension tables using a set of datastage jobs.

What should I do to to refresh my DW in future ?

I mean if I want to load the data that is only new and that was not there in the source last time when I run my job.

Can the same job do this thing if yes how this job would identify that this data is already extracted,transformed and loaded in DW.

Because I think that if I try doing it with the same job I created for first time loading the job would get aborted.

Any suggestions..??

Manoj.

Amos.Rosmarin · Post by **Amos.Rosmarin** » Wed Nov 05, 2003 4:34 am

You should look for messages regarding slowly changing dimm such as roys' answer in :
viewtopic.php?t=177

I think you'll get a very detailed answer there

HTH,
Amos

ray.wurlod · Post by **ray.wurlod** » Wed Nov 05, 2003 3:51 pm

Basically you've identified the issue, that initial load and subsequent incremental loads are slightly different. Ignoring issues of how you want to manage slowly changing dimensions for the moment, the main difference is detecting the changed source records (including those that might be being changed by the ETL process).

There are several strategies for doing this, most of which are covered either in the Ascential training classes (particuarly Best Practices) or in posts to these fora.

Change detection can (should?) be built into the initial load job, even though they should detect none in this circumstance. It means that you can use the same job for both initial and incremental loads. This is one of the many advantages of comprehensive planning and of using the experience represented in the classes and on these lists. (And, dare I add, of hiring experienced consultants?!)

girishoak · Post by **girishoak** » Fri Nov 07, 2003 11:55 pm

Hi Vipin,

To achieve or fulfill your requirement. What I think is that you should add a column in source table that will hold value 'Y' or 'N' for the row extracted or not. Or you can also put some timestamp column that will hold extraction date for each row.

Also to use such solution, you need to modify the query of extraction such as extracted_rec (this is proposed column name varchar2(1) ) = 'N' or extracted_date > date from which you want to extract.

then you need to run job this job to extract this desired data, dont forget to update to column value. Otherwise you wont be able to identify the new records.

The bottom line is you should have some field in your source system that will indicate that the record is new.

I hope this will help you. Let me know got any other suggestion or ideas.

Thanks

Girish Oak

kcbland · Post by **kcbland** » Sat Nov 08, 2003 11:44 pm

I think you need to read Kimball's The Data Warehouse Lifecycle Toolkit for starters. You also should visit the following sites:

Highly recommended:
http://www.dwinfocenter.org/

Visit after a strong cup of coffee:
http://www.ralphkimball.com/
http://www.inmoncif.com/

Visit when you convince management to pay to send you to a conference:
http://www.dw-institute.com/

First, a onetime load of all data into your warehouse has different design considerations than a cyclical incremental update. Second, a daily incremental update requires design considerations to determine the "changed data". Third, you have to have an enterprise architecture for your data warehouse.

The last item means that the ETL tool is not going to get you where you need to be going. You may have to define storage structures in the data warehouse than relate almost purely to the infrastructure, and have no tangible benefit to the end user. This translates into $$$ for the disk space to do things like setting up a persistent staging database. From my experience, most companies are not willing to put changes into a legacy system to facilitate incremental data extraction. But, I'm not typing for points here, Kimball discusses this at length in his book, and with much better English than mine.

As for making a one-time load jobstream work for cyclical processing, this almost never works. One-time loads usually are insert only processes, whereas cyclical involve lookups against the target to see if the row is there, surrogate key assignments, foreign surrogate key substitution, slowly changing dimensions, incremental aggregate update, etc. You have different volume considerations, as well as all of the details as to changed data detection.

If a source table is of insignificant volume, it can be expedient to simply grab the whole table and retransform it, throwing away any rows during transformation that are already in the warehouse and have no material difference. If the source table is of significant volume, then you have to look for means to slice it by time. A last_update_date column can be helpful, but if your source table is something like a G/L table, you can extend your ETL to track the last time it ran and for what business date. Your selection criteria can include the last successful business date as a parameter into the WHERE clause for a transaction/posting date column.

Lastly, if the source table is for dimensional type information with no time element like a fact table or a fact table that is not insert-only, you have to look at either: an ordered full table extract to file, then a bitwise diff from the previous run to pop out new or changed rows; a trigger on insert/update to capture to an audit table the primary key and change date for the ETL process to use in a SELECT; or a transaction log sniffer (CDC) to assist tracking changes to the table.