Roll up monthly data to SCD

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Vinodanand
Premium Member
Premium Member
Posts: 112
Joined: Mon Jul 11, 2005 7:54 am

Roll up monthly data to SCD

Post by Vinodanand »

Hi,

I have a requirement where in the monthly data needs to be rolled up based on a set of key columns into an SCD format . Per the below example :

Database : Oracle
Key column : Name,State

Code: Select all

Current data : 
-------------------

Name : xyz
State   : new york
Start Date :01-JAN-2009
End Date  :31-JAN-2009

Name : xyz
State   : new york
Start Date :01-FEB2009
End Date  :28-FEB-2009

Name : xyz
State  : Mass
Start Date :01-MAR-2009
End Date  :31-MAR-2009

Changed data : 
---------------------

Name : xyz
State   : new york
Start Date :01-JAN-2009
End Date  :28-FEB-2009

Name : xyz
State  : Mass
Start Date :01-MAR-2009
End Date  :31-MAR-2009
This can be achieved by using a transformer . Is there any other way that we can get the desired result. The volume of records will be over 7 million . Any suggestion is appreciated .

Thanks,
Vinod
changming
Participant
Posts: 68
Joined: Wed Oct 13, 2004 3:35 am

Re: Roll up monthly data to SCD

Post by changming »

use oracle analytic function (Lag or lead)to get current state and next state, current start date and next end date partition by name order by date.
if current is equal to next, use current start date and next end date.
there will be a problem on formance.
any better idea?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

SELECT NAME, STATE, MIN(START_DATE), MAX(END_DATE)
FROM tablename
GROUP BY NAME, STATE

Did I miss something?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Vinodanand
Premium Member
Premium Member
Posts: 112
Joined: Mon Jul 11, 2005 7:54 am

Post by Vinodanand »

Hi Ray,
I missed the last set of records in the above example i.e.

Code: Select all

-------------------
Current data : 
------------------- 

Name : xyz 
State   : new york 
Start Date :01-JAN-2009 
End Date  :31-JAN-2009 

Name : xyz 
State   : new york 
Start Date :01-FEB2009 
End Date  :28-FEB-2009 

Name : xyz 
State  : Mass 
Start Date :01-MAR-2009 
End Date  :31-MAR-2009 

Name : xyz 
State   : new york 
Start Date :01-APR-2009 
End Date  :31-APR-2009 

-------------------------
Changed data : 
--------------------- 

Name : xyz 
State   : new york 
Start Date :01-JAN-2009 
End Date  :28-FEB-2009 

Name : xyz 
State  : Mass 
Start Date :01-MAR-2009 
End Date  :31-MAR-2009 

Name : xyz 
State   : new york 
Start Date :01-APR-2009 
End Date  :31-APR-2009 


In this case the above query will not yield the desired results .
I think the lag/lead option by suggested by Changming would work,but need to explore the same.

One way is to process the records using stage variables in a transformer after they are sorted . Are there any other options to do it in data stage ?

Second approach was to sort the data based on the name,start date,end date fields and split into two streams using a copy stage to pass through two remove duplicate stages so as to retain the first and the last record,thereby picking the min and max dates. The only problem being in is I will not be able to join back the two streams .

Thanks again and appreciate your inputs on the same.

Regards,
Vinod
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Funnel stage will bring them back together.

But I think LAG is the optimum solution - what if there is more than one gap in the dates?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Vinodanand
Premium Member
Premium Member
Posts: 112
Joined: Mon Jul 11, 2005 7:54 am

Post by Vinodanand »

ray.wurlod wrote:Funnel stage will bring them back together.

But I think LAG is the optimum solution - what if there is more than one gap in the dates?
Hi,
Ray, I dont think there will be gaps in the dates as the data has already been processed as monthly and we have all the months .

I think it's a trade off between using the Transformer stage in DataStage or using the Oracle Analytical functions and going for a stored procedure.

Has anyone done this before with volume over 7 million... Would DataStage be a better option compared to Oracle ?

Appreciate your suggestions.

Thanks,
Vinod
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If there are no gaps, then my MIN and MAX solution is good. And faster. Especially if these columns are indexed with B-tree indexes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Vinodanand
Premium Member
Premium Member
Posts: 112
Joined: Mon Jul 11, 2005 7:54 am

Post by Vinodanand »

ray.wurlod wrote:If there are no gaps, then my MIN and MAX solution is good. And faster. Especially if these columns are indexed with B-tree indexes.
Thanks Ray. I used the lag and lead Oracle analytic functions to derive the desired results . I need to test it for huge volume and am happy to share if any one needs help on the same.

Regards,
Vinod
Post Reply