I have 3 options to design a simple job. For each of the circumstances, say, when the size of data is small, medium, large, very large; can you please suggest which of the designs would be good and why?
option1 (direct loading to target)
--------
relational source(s) ==> processing ==> relational target(s)
option2 (load to flat-file and then to target)
--------
relational source(s) ==> processing ==> flat-file ==> relational target(s)
option3 (load to flat-file in one job and in next job, to target)
--------
job1:
relational source(s) ==> processing ==> flat-file
job2:
flat-file ==> relational target(s)
And if there are any better options??
Thanks,
gateleys
Using flat-file to stage data
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 145
- Joined: Fri May 02, 2003 9:59 am
- Location: Seattle, Washington. USA
Re: Using flat-file to stage data
gateleys,
Option 3 is what I would do since it decouples the source system from the destination. In addition it gives you the ability to recover by only running the job that failed.
Option 3 is what I would do since it decouples the source system from the destination. In addition it gives you the ability to recover by only running the job that failed.
Shawn Ramsey
"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
Re: Using flat-file to stage data
Apart from the fact that Option 3 provides flexibility in terms of recovering from the respective job failure, what merit(s) does decoupling of the source and target have? Also, how does each of the option fare in terms of performance? The third option certainly looks worse than the first two in terms of performance, unless it provides other 'worthwhile' advantages. Please suggest.shawn_ramsey wrote: Option 3 is what I would do since it decouples the source system from the destination.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Flat files are very fast.
Extract to staging area can be run in a convenient time window for extraction. Decoupling means that you don't have to do the load phase at the same time; it can be run at a convenient time for the target database.
Some folks, with complex transformation/cleansing requirements, even use two staging areas - one after extraction and one before load - so that the transformation/cleansing can occur even if neither source nor target is available.
Since the main KPI for ETL is the ability to meet execution windows, I believe that "performance" can sometimes better be met with such a decoupled approach.
If the requirement is that the target be populated as quickly as possible after the operational system is updated, then other approaches are warranted. This might include "trickle feed" approaches - much smaller batches executed more frequently, or it might even involve using the real time interface (RTI) technology available for DataStage. DataStage TX is also a candidate for this type of processing in near real time.
Extract to staging area can be run in a convenient time window for extraction. Decoupling means that you don't have to do the load phase at the same time; it can be run at a convenient time for the target database.
Some folks, with complex transformation/cleansing requirements, even use two staging areas - one after extraction and one before load - so that the transformation/cleansing can occur even if neither source nor target is available.
Since the main KPI for ETL is the ability to meet execution windows, I believe that "performance" can sometimes better be met with such a decoupled approach.
If the requirement is that the target be populated as quickly as possible after the operational system is updated, then other approaches are warranted. This might include "trickle feed" approaches - much smaller batches executed more frequently, or it might even involve using the real time interface (RTI) technology available for DataStage. DataStage TX is also a candidate for this type of processing in near real time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.