Summing up Financial Year to Date data 1 April to Mar 31

DaMouse · Post by **DaMouse** » Mon Nov 21, 2005 10:15 pm

Hi there!

Has anyone had to deal with summing up data that is by a financial where the start date is in one year and the end date is in the next year.
For example the financial year begins at 1st of April each year and ends at 31 Mar the following year e.g. 01/04/2005 - 31/03/2005.

Currently where I am working the financial year is changing from a very simple Jan 1 to Dec 31 to the financial year I have mentioned above, 1 April to Mar 31.

This means our YTD figures need to be calculated based on this time period.

Not sure as to how you would go about it. The problem is summing when it gets to the months of January, February and March.

Has anyone encountered this sinario before and if so, how do you manage this?

Any help is very much appreciated!
thanks,
DaMouse

kcbland · Post by **kcbland** » Mon Nov 21, 2005 10:27 pm

Your issue is a simple one, something your architect should have solved quite easily using a Time dimension. Read Kimball's Data Warehouse Toolkit, he covers it extensively.

A table that contains every date from some arbitrary point in the past, such as 1700-01-01, thru the near future, say 2099-12-31, is quite useful. You can have many column attributes that describe a day. The columns also have indicators set for which dates are business, which dates are holidays, which dates are leap days.

The useful trick is to set dates to have quarters. The fun is that there are actually 3 "standard" calendars. There's the one we're all used to, the one that starts in January and ends in December. Then there's the "fiscal" calendar, where the 2006 year begins October 1st, 2005. The FUN one is the "360/365" calendar where there are no months, just 13 x 30 day "periods" with 5 days that don't exist between Christmas and New Years.

The point is that the Time dimension allows you to report any number using any reporting calendar you wish, simply by joining dates to the Time dimension and grouping/filtering by attributes found there.

DaMouse · Post by **DaMouse** » Mon Nov 21, 2005 10:47 pm

Yes, I have just finished updating the Time Dimension to cater for the new financial year. That was the easy part.

What I should have explained is this.

We have two data bases. One we class as the Data Warehouse. This is where we load the data from the source systems into. We then have another database which is our data mart and this contains the Time Dimension (date_dim).

We query our data warehouse to sum up ytd numbers which are then pushed into the datamart (note these are not FIN YTD figures).

Where I am working, they now want new measures created that provide Financial YTD figures.

Our data warehouse does not contain a date_dim table. So we can't use this to work out the FIN YTD results.

So, thinking out loud here, in an ETL job would the best solution be to query what we need from the data warehouse database, then include a hash table that is basically the Date_Dim, then join the information from the data warehouse query to the date_dim hash table and then sum this over the hash tables dates to get the ytd figures for the fiscal year?

kcbland · Post by **kcbland** » Tue Nov 22, 2005 7:33 am

Why not simply put the Time dimension in the warehouse? It's a relatively small, static table, what's the harm. Putting it in a hash table, do you mean hash file? If you're using PX, it would be a dataset.

The PX and Server aggregators are not your preferred solution for complex aggregates. SQL is quite powerful and more likely to be easier to write than the configuration in an aggregator stage. You should consider using SQL, since most of your mapping documents would be using pseudo-SQL anyway.

DaMouse · Post by **DaMouse** » Tue Nov 22, 2005 11:38 pm

No need for a TIME DIM in the data warehouse database.

The following code works out the correct financial year dates without using a time dim.

select CASE WHEN to_char(END_DATE, 'MM') < '04' THEN
to_date(to_char(add_months(END_DATE,-12), 'YYYY') || '-04-01', 'YYYY-MM-DD')
ELSE
to_date(to_char(END_DATE, 'YYYY') || '-04-01', 'YYYY-MM-DD')
END START_DATE,
END_DATE
from
(select LAST_DAY(ADD_MONTHS(TO_DATE('2005-03-31', 'YYYY-MM-DD'), -ROWNUM+1)) END_DATE
FROM ALL_OBJECTS
WHERE ROWNUM <= 12)
order by 2

Time Dim should be used for the Business View of the information, not when trying to transform data and load it into a data mart.

kcbland · Post by **kcbland** » Wed Nov 23, 2005 8:04 am

To repeat this query continuously can't be efficient, nor can it be indexed or optimized in any fashion. The join to ALL_OBJECTS is not a good idea, as the query parser will expand this logic when joined with other tables and you'll not get the benefits of any kind of optimized joins. Forget about even getting a Star join.

Since this data is static, never changes, why not save it in a table and then index and optimize it? When you build static data every single time you run, it doesn't make sense. Your blanket statement "No need for a TIME DIM in the data warehouse database. " is patently wrong. Even a DBA would tell you to materialize this data in a table so that it can be tuned and indexed. Especially with high volume data builds you need to remove any "virtual" type tables such as this. In addition, creating a view out of this query is equally bad, you gain nothing.

DaMouse · Post by **DaMouse** » Thu Nov 24, 2005 7:03 pm

Get off your high horse.
You clearly didn't understand what I said.

We have two databases here.

1st database is what we class as the data warehouse. this contains all the source systems' data.

We then transform and load this data into the 2nd database, which is a datamart and contains the 'business view' of the data. In this database, we have all the dimension tables, date_dim included.

Our ETL jobs don't process that much information so using all-objects is fine for what we are trying to do here. It may not be 'best practice' but we are only processing a very low volume of information using this query.

The datawarehouse database was not designed by me and the DBA's where i am working have not mentioned anything about putting in a TIME Dimension in the data warehouse database. In a datamart, for sure, it is definitely required.

anyway, I am not going to argue with you about 'best practice' for data warehouses. Every business has its individual needs and a data warehouse should be tailored for the organisation whilst following the general principles for designing a data warehouse model.

jenkinsrob · Post by **jenkinsrob** » Thu Nov 24, 2005 7:40 pm

DaMouse,

There are certain members of DSXchange who take a lot of time to share their considerable knowledge with other users of this site. Kenneth Bland is one such member.

Perhaps you should take his advise less personally because then you might learn something...

vmcburney · Post by **vmcburney** » Thu Nov 24, 2005 8:28 pm

The complex SQL approach is not as robust or maintainable as having a lookup table such as a time dim. You have a duplication of code rather then code sharing, one spelling mistake by a novice programmer and the job will still work and produce results but all the aggregation will be wrong. Eg. typing '-01-04' instead of '-04-01'.

kcbland · Post by **kcbland** » Thu Nov 24, 2005 8:43 pm

It seems any answer I will give will sound like it's coming from my "high horse". I would point out that the foreign key substitution of dates for their time dimension surrogate key is done at the EDW/CDW level, per Kimball, Inmon, etc. By all definitions that I know, surrogate key assignment and substitution takes place here because of the integration, collation, and other activities critical to the EDW. Datamarts usually are simple rollups or aggregations with collapsed, shrunken, or pre-joined dimensions.

You've posted that you're using PX, which for a gambler, would be a pretty sure bet that you're dealing with high volumes. Playing those odds, I tried to impart the issues with using your posted query as reference SQL. Maybe you get no benefit from the answer, but the other 200+ people who would read this post over the next 12 months might find it interesting.

Whilst thinking out loud, you mentioned using a hash table (hash file?) of the same dimension during transformation. Since your post is PX, I suggested a dataset would be better considering the technology. Instead of copying the Time dimension into the warehouse instance, you were thinking of just using it within the ETL, which was fine. I just suggested using a dataset instead of a hash file so that your job stayed within the PX framework (because I again was thinking about performance). However, knowing Server and PX Aggregator stages well (we call them Aggravators), I recommended using SQL based rollups rather than use these stages. This is because streaming a lot of data out of a database just to roll it up externally can be less efficient than rolling it up within the database. If you're rollup up 100M rows to 50K rows, the network spool time is vastly less than spooling 100M rows. DBs like Oracle and UDB are really good at this, but I guessed you weren't using Redbrick otherwise you would have been forced into using a Time dimension and your aggregations are internalized within that product.

So then you switched gears and posted a query. In the PX framework, direct database lookups are anathema to performance. If you had 16 nodes simultaneously hitting the database, that would bottleneck around the database saturation. If you used this query in a join, you would have nasty SQL. If you created a view of this query and then used the view in your queries, I questioned whether your queries would scale (again thinking about performance and high volumes) because of the repititous nature of the SQL and the static nature of the data. For the first time, I actually tried to steer you back to your original ideas. The time dimension table in the EDW would make SQL based datamart generation easier and more scalable, or at least go back to using a static generated dimension like in the hash file or dataset solution.

It's kind of funny, your first post:

Has anyone encountered this sinario before and if so, how do you manage this?

Any help is very much appreciated!

I posted use a time dimension. You responded you already did and it worked great.

You then posed an idea using a hash table. I responded that's okay, but consider a dataset since you're on PX. However, I recommended rollups within the database (if your rollup reduction is considerable, left that part out

) and thus you need the time dimension there.

I actually thought we were having a great discussion, clearly I was suffering from post-Las Vegas traumatic syndrome. As for not understanding what you said, I rather thought I did and was speaking to a larger audience as well as to yourself. I'm sorry your first posting experience here was bad. I'll keep my advice to myself for the rest of this topic and let the others chime in.