sql query to ETL

bobby · Post by **bobby** » Tue Dec 07, 2004 12:50 pm

Hi,
Client gave a requirment like this they wanna it do with ETL now
i have few quesions if u can help as they are using these 3 source table
FROM clause , my question is how can i proceed in ETL job to get this query .
Thanks
Bobby

INSERT INTO
dwadmin.FCT_REVENUE_SUMMARY_JAN_2004
(SELECT MAX(F.REVENUE_KEY),
DECODE(F.REVENUE_TYPE_KEY, 10000003, TRUNC(F.BILL_END_DT, 'MONTH'), TRUNC(POSTED_DT, 'MONTH')),
F.SUBSCRIBER_KEY,
F.SUBSCRIBER_DIM_KEY,
TRUNC(F.CYCLE_DT, 'MONTH'),
TRUNC(F.BILL_START_DT, 'MONTH'),
TRUNC(F.BILL_END_DT, 'MONTH'),
TRUNC(F.TRAN_DT, 'MONTH'),
TRUNC(F.POSTED_DT, 'MONTH'),
DECODE(SIGN(F.PPV_KEY), 1, TRUNC(F.POSTED_DT, 'MONTH'), TRUNC(F.BILL_END_DT, 'MONTH')),
DECODE(SIGN(F.PPV_KEY), 1, TRUNC(F.TRAN_DT, 'MONTH'), NULL),
F.PROGRAM_PACKAGE_DIM_KEY,
F.PROGRAM_PACKAGE_KEY,
F.SERVICE_DIM_KEY,
F.SERVICE_KEY,
DECODE(SIGN(F.PROGRAM_PACKAGE_KEY), 1, 'P',
DECODE(SIGN(F.PPV_KEY), 1, 'V',
DECODE(SIGN(F.SERVICE_KEY), 1, 'S', 'X'))),
DECODE(SIGN(F.PROGRAM_PACKAGE_KEY), 1, 'P' || F.PROGRAM_PACKAGE_KEY,
DECODE(SIGN(F.PPV_KEY), 1, 'V' || F.PPV_KEY,
DECODE(SIGN(F.SERVICE_KEY), 1, 'S' || F.SERVICE_KEY, NULL))),
F.PPV_KEY,
F.REVENUE_TYPE_KEY,
F.PARTIAL_BILLING_FLG,
SUM(F.BILLING_AMT),
F.SOURCE_OF_DATA,
S.REGION_KEY,
S.SUBSCRIBER_TYPE_KEY,
S.SUBSCRIBER_LANGUAGE_KEY,
S.VIP_KEY,
S.SUBSCRIBER_STATUS_KEY,
S.RURAL_URBAN_FLG,
S.PROMOTION_CAMP_KEY,
TRUNC(S.CREATE_DATE, 'MONTH'),
TRUNC(S.DISCONNECT_DT, 'MONTH'),
S.DISCONNECT_REASON_CD,
S.NO_OF_MTH_BILL_PERIOD,
S.DISCOUNT_CD,
TRUNC(SB.BILL_END_DT, 'MONTH'),
SB.SERVICE_DIM_KEY,
SB.SERVICE_KEY,
SB.A_LA_CARTE_LEVEL,
SUM(SB.BILLING_AMT),
SB.PROGRAM_PACKAGE_KEY,
SB.PROGRAM_PACKAGE_DIM_KEY,
SB.SUBS_PRODUCT,
0,
TRUNC(S.CONNECT_DT, 'MONTH') CONNECT_DT
FROM DWADMIN.FCT_ALL_REVENUE F,
DWADMIN.DIM_SUBSCRIBER S,
DWADMIN.DIM_SUBSCRIBER_BASIC SB
WHERE F.SUBSCRIBER_DIM_KEY = S.SUBSCRIBER_DIM_KEY AND
F.REVENUE_KEY = SB.REVENUE_KEY(+) AND
TO_CHAR(TRUNC(DECODE(F.REVENUE_TYPE_KEY, 10000003, F.BILL_END_DT, POSTED_DT), 'MONTH'),'YYYY-MM-DD HH24:MI:SS') = '2004-01-01 00:00:00'
GROUP BY DECODE(F.REVENUE_TYPE_KEY, 10000003, TRUNC(F.BILL_END_DT, 'MONTH'), TRUNC(POSTED_DT, 'MONTH')),
F.SUBSCRIBER_KEY,
F.SUBSCRIBER_DIM_KEY,
TRUNC(F.CYCLE_DT, 'MONTH'),
TRUNC(F.BILL_START_DT, 'MONTH'),
TRUNC(F.BILL_END_DT, 'MONTH'),
TRUNC(F.TRAN_DT, 'MONTH'),
TRUNC(F.POSTED_DT, 'MONTH'),
DECODE(SIGN(F.PPV_KEY), 1, TRUNC(F.POSTED_DT, 'MONTH'), TRUNC(F.BILL_END_DT, 'MONTH')),
DECODE(SIGN(F.PPV_KEY), 1, TRUNC(F.TRAN_DT, 'MONTH'), NULL),
F.PROGRAM_PACKAGE_DIM_KEY,
F.PROGRAM_PACKAGE_KEY,
F.SERVICE_DIM_KEY,
F.SERVICE_KEY,
DECODE(SIGN(F.PROGRAM_PACKAGE_KEY), 1, 'P',
DECODE(SIGN(F.PPV_KEY), 1, 'V',
DECODE(SIGN(F.SERVICE_KEY), 1, 'S', 'X'))),
DECODE(SIGN(F.PROGRAM_PACKAGE_KEY), 1, 'P' || F.PROGRAM_PACKAGE_KEY,
DECODE(SIGN(F.PPV_KEY), 1, 'V' || F.PPV_KEY,
DECODE(SIGN(F.SERVICE_KEY), 1, 'S' || F.SERVICE_KEY, NULL))),
F.PPV_KEY,
F.REVENUE_TYPE_KEY,
F.PARTIAL_BILLING_FLG,
F.SOURCE_OF_DATA,
S.REGION_KEY,
S.SUBSCRIBER_TYPE_KEY,
S.SUBSCRIBER_LANGUAGE_KEY,
S.VIP_KEY,
S.SUBSCRIBER_STATUS_KEY,
S.RURAL_URBAN_FLG,
S.PROMOTION_CAMP_KEY,
TRUNC(S.CREATE_DATE, 'MONTH'),
TRUNC(S.DISCONNECT_DT, 'MONTH'),
S.DISCONNECT_REASON_CD,
S.NO_OF_MTH_BILL_PERIOD,
S.DISCOUNT_CD,
TRUNC(SB.BILL_END_DT, 'MONTH'),
SB.SERVICE_DIM_KEY,
SB.SERVICE_KEY,
SB.A_LA_CARTE_LEVEL,
SB.PROGRAM_PACKAGE_KEY,
SB.PROGRAM_PACKAGE_DIM_KEY,
SB.SUBS_PRODUCT,
0,
TRUNC(S.CONNECT_DT, 'MONTH'));
commit;

update FCT_REVENUE_SUMMARY_JAN_2004
set subscriber_ct = 1
where revenue_key
in (
select max(revenue_key)
from FCT_REVENUE_SUMMARY_JAN_2004
group by
subscriber_key);
commit;

datastage · Post by **datastage** » Tue Dec 07, 2004 1:11 pm

It is certainly do-able, but it will take awhile to code and verify you haven't overlooked something.

First of all I'd ask if they have any documentation on the original requirements from when this query was made. Technically you can just look through the query, but it would help to have an understanding of the original requirement here.

Also, if all they need is to get this automated within a datastage driven load of a system you can paste the query into a user-defined sql and make sure you have the same number of columns defined in the stage that the query outputs.

If you plan to break this up in DataStage there are many approaches and when looking at it briefly I would recommend this:

1) save the functions (func, sign, decode, etc...) for the transform stage.

2) this looks like you could select from DWADMIN.DIM_SUBSCRIBER S and
DWADMIN.DIM_SUBSCRIBER_BASIC SB individually and write them to hashed files. Then preload them to memory and join them in a transformer that selects from DWADMIN.FCT_ALL_REVENUE F. This should help performance although it depends of the database server, ds server, and network and provide some tracibility for debugging.

3) the group by could be done in aggregator or still in GROUP BY clauses, this tends to be vary by developer preference and in my experience sometime groups are faster in the queries and sometimes faster in aggregator stages.

4) I'd create a separate job for the insert and the update statement. Its a pain when you get so many jobs but breaking things out still seems to be the best practice.

hope this helps as a start...

bobby · Post by **bobby** » Tue Dec 07, 2004 1:24 pm

Hi there,
They have no requirmnet all they gave SQL and want it be done in datastage,
i have few question
1)As u said user defined SQL i think that will be quick fix , so u mean inport everything in plugin metadata and run user define query,

2) Making 2 hash files but i think i have to evaulate more data to see business keys rite , i mean i have to make some model so that i can proceed.

So wht u suggest user define query will help more.
Thanks,
Bobby

ray.wurlod · Post by **ray.wurlod** » Tue Dec 07, 2004 3:16 pm

Put the SELECT statement in one stage, run the results through a Transformer stage (even if you do nothing to the data) and have the final stage perform the INSERT (or bulk load).

Create another job that updates the control job; similar design: SELECT in one stage, Transformer, update in final stage.

Run the jobs from a job sequence, possibly (if appropriate) making the second dependent upon successful completion of the first.

vmcburney · Post by **vmcburney** » Tue Dec 07, 2004 5:07 pm

The quickest approach is to turn the SELECT clause of the insert statement into a database view. Then you can import the metadata definition of the view straight into DataStage and retrieve rows from it.

The long approach is rewrite it using DataStage functions. Transformation code such as DECODE and TRUNC would be removed from the select statement and replaced by derivation code in a DataStage transformer. This makes it easier to maintain and report on.

You can also remove the aggregation from the select clause and put it into a DataStage aggregation stage. You can remove your left outer join and make that table a lookup to a transformer stage. In your current SQL your left outer join gives you a lot of NULL fields, doing this join in a transformer lets you add more logic around the handling of missing fields.

With the quick approach you need someone proficient in Oracle and DataStage in order to make changes to the job, in the second approach you only need to be proficient in DataStage to maintain the job.

bobby · Post by **bobby** » Wed Dec 08, 2004 11:40 am

hI,
I imported table in plugin metadata just add all columns to my rows
and try to view the Data , Its taking forever,
plz let me know wht i am thinking now is
make 2 hash files of dim_subscriber and dim_subscriber_basic
and and then matcing that hash files with fct_all_revenue

could u pls guide,
while seeing at query what business keys i can join between dimension and fact table ,
Thanks,
Bobby

vmcburney · Post by **vmcburney** » Wed Dec 08, 2004 6:20 pm

The query will take forever, it contains a lot of decode statements, left outer joins and aggregation. It is placing the load for the transformation and matching on the RDBMS and doing it all in a single request. Moving these functions to DataStage moves the load onto the ETL server and performs them a row at a time.

In end to end processing the RDBMS approach may end up being faster, it all depends on the SQL optimisation and efficiency.

I like the approach of a single input stream with the bulk of the data, feed this to a transformer with two hash file lookups, and then write it out. The aggregation may need to be performed on the input stream and one or both of the hash file lookups. Once you remove the joins you may find aggregation is easier to leave in the RDBMS rather then a DataStage aggregation stage.

rasi · Post by **rasi** » Wed Dec 08, 2004 6:33 pm

Bobby

It depends on number of records you have in your tables. It won't happen immediate when you start running the job. Check how long it takes to run the job. And have your DBA with you in checking whether anything can be done to the query to run it fast either by adding index etc... Doing aggregation will also take time but at the same time it is better doing in database compared to Datastage Aggregator Stage. If you are dealing with big volume of records then go for the database aggregation.

Thanks

peternolan9 · Post by **peternolan9** » Mon Dec 13, 2004 8:58 am

Bobby,
let me guess, the client has written their ETL in PL/SQL + Oracle views etc??? If this is the case I've put my comment on this on my FAQ page.

http://www.peternolan.com/faqs.htm#A004

My advice to customers who have done this and want to 'upgrade' what they have done with DS?

1. Take out a blank piece of paper.
2. Start designing your ETL properly to take advantage of your ETL tool.

I have advised lots of people not to try and write their ETL in PL/SQL. And a good thing they all ignore me because it creates a lot of work for DS/INFA people to replace all that PL/SQL..

bobby wrote:Hi,
Client gave a requirment like this they wanna it do with ETL now
i have few quesions if u can help as they are using these 3 source table
FROM clause , my question is how can i proceed in ETL job to get this query .
Thanks
Bobby

ray.wurlod · Post by **ray.wurlod** » Mon Dec 13, 2004 2:47 pm

I too give the "blank piece of paper" advice, but alas it seems that most of the folks are deaf. Or just want to play with their shiny new DS toy!

Anyway, I can make a living doing it right the second time!

Folks, it's GREAT advice!

bobby · Post by **bobby** » Wed Dec 15, 2004 12:35 pm

The quickest approach is to turn the SELECT clause of the insert statement into a database view. Then you can import the metadata definition of the view straight into DataStage and retrieve rows from it

I TRY this in TOAD and datasatge select statement

DECODE(F.REVENUE_TYPE_KEY, 10000003, TRUNC(F.BILL_END_DT, 'MONTH'), TRUNC(POSTED_DT, 'MONTH')),

toad gives me result
datastage result sumthing else
can u plz advice
thanks
Bobby

ray.wurlod · Post by **ray.wurlod** » Wed Dec 15, 2004 3:04 pm

Look carefully on the SQL tab to see whether there is any difference at all between the SQL in DataStage and that you're using in TOAD.

bobby · Post by **bobby** » Wed Dec 15, 2004 3:27 pm

Hi
TRUNC(F.CYCLE_DT, 'MONTH'),
gives like 01/02/04 in TOAD

In datastage Timestamp doesnot help ,i have to do varchar and it gives
me date like 01-feb-04 (Timestamp gives error attempt to convert string value unsuceesful, but time is in 01/02/04 in source)
plz Advice,
Regards
Bobby

DSXchange

sql query to ETL

sql query to ETL

Re: sql query to ETL