Using flat-file to stage data

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Using flat-file to stage data

Post by gateleys »

I have 3 options to design a simple job. For each of the circumstances, say, when the size of data is small, medium, large, very large; can you please suggest which of the designs would be good and why?

option1 (direct loading to target)
--------
relational source(s) ==> processing ==> relational target(s)
option2 (load to flat-file and then to target)
--------
relational source(s) ==> processing ==> flat-file ==> relational target(s)
option3 (load to flat-file in one job and in next job, to target)
--------
job1:
relational source(s) ==> processing ==> flat-file
job2:
flat-file ==> relational target(s)

And if there are any better options??

Thanks,
gateleys
shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Re: Using flat-file to stage data

Post by shawn_ramsey »

gateleys,

Option 3 is what I would do since it decouples the source system from the destination. In addition it gives you the ability to recover by only running the job that failed.
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Re: Using flat-file to stage data

Post by gateleys »

shawn_ramsey wrote: Option 3 is what I would do since it decouples the source system from the destination.
Apart from the fact that Option 3 provides flexibility in terms of recovering from the respective job failure, what merit(s) does decoupling of the source and target have? Also, how does each of the option fare in terms of performance? The third option certainly looks worse than the first two in terms of performance, unless it provides other 'worthwhile' advantages. Please suggest.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Flat files are very fast.

Extract to staging area can be run in a convenient time window for extraction. Decoupling means that you don't have to do the load phase at the same time; it can be run at a convenient time for the target database.

Some folks, with complex transformation/cleansing requirements, even use two staging areas - one after extraction and one before load - so that the transformation/cleansing can occur even if neither source nor target is available.

Since the main KPI for ETL is the ability to meet execution windows, I believe that "performance" can sometimes better be met with such a decoupled approach.

If the requirement is that the target be populated as quickly as possible after the operational system is updated, then other approaches are warranted. This might include "trickle feed" approaches - much smaller batches executed more frequently, or it might even involve using the real time interface (RTI) technology available for DataStage. DataStage TX is also a candidate for this type of processing in near real time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

Ray and Shawn,
Thanks for the response. Will keep your advice in mind while designing jobs.
Post Reply