Page 1 of 1

Using flat-file to stage data

Posted: Fri Jan 13, 2006 10:21 am
by gateleys
I have 3 options to design a simple job. For each of the circumstances, say, when the size of data is small, medium, large, very large; can you please suggest which of the designs would be good and why?

option1 (direct loading to target)
--------
relational source(s) ==> processing ==> relational target(s)
option2 (load to flat-file and then to target)
--------
relational source(s) ==> processing ==> flat-file ==> relational target(s)
option3 (load to flat-file in one job and in next job, to target)
--------
job1:
relational source(s) ==> processing ==> flat-file
job2:
flat-file ==> relational target(s)

And if there are any better options??

Thanks,
gateleys

Re: Using flat-file to stage data

Posted: Fri Jan 13, 2006 11:22 am
by shawn_ramsey
gateleys,

Option 3 is what I would do since it decouples the source system from the destination. In addition it gives you the ability to recover by only running the job that failed.

Re: Using flat-file to stage data

Posted: Fri Jan 13, 2006 1:33 pm
by gateleys
shawn_ramsey wrote: Option 3 is what I would do since it decouples the source system from the destination.
Apart from the fact that Option 3 provides flexibility in terms of recovering from the respective job failure, what merit(s) does decoupling of the source and target have? Also, how does each of the option fare in terms of performance? The third option certainly looks worse than the first two in terms of performance, unless it provides other 'worthwhile' advantages. Please suggest.

Posted: Fri Jan 13, 2006 3:25 pm
by ray.wurlod
Flat files are very fast.

Extract to staging area can be run in a convenient time window for extraction. Decoupling means that you don't have to do the load phase at the same time; it can be run at a convenient time for the target database.

Some folks, with complex transformation/cleansing requirements, even use two staging areas - one after extraction and one before load - so that the transformation/cleansing can occur even if neither source nor target is available.

Since the main KPI for ETL is the ability to meet execution windows, I believe that "performance" can sometimes better be met with such a decoupled approach.

If the requirement is that the target be populated as quickly as possible after the operational system is updated, then other approaches are warranted. This might include "trickle feed" approaches - much smaller batches executed more frequently, or it might even involve using the real time interface (RTI) technology available for DataStage. DataStage TX is also a candidate for this type of processing in near real time.

Posted: Fri Jan 13, 2006 3:30 pm
by gateleys
Ray and Shawn,
Thanks for the response. Will keep your advice in mind while designing jobs.