Hi,
here in our project we are extracting data from zip files and then validating the data then actually we need to write this data into Netezza database. But in the jobs already developed what they have done is they are writing data into a sequential file and the in another job they are writing from sequential to Netezza. And they are using a sequencer in between these jobs. i.e this is another sequencer job.
When I asked mu collegue she told that you should not do extraction and loading in the same job. Is this correct. In commonsence I am unable to agree this. Because you are writing into a text file is also some sort of loading, what I understood.
Can Gurus can give me any explanation. Please tell where I can find guidelines in designing jobs.
Thanks
Kavuri
extraction and loading
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Best practice is, indeed, to stage your data at least once between extraction and loading. It is not necessary to do so, merely wise.
It means, among other things, that you only have to perform extraction once (important for "point in time" extraction); that you can have separate time windows in which extraction and loading can occur; that you have restartability and repeatability in your design and (with suitable archiving of the staging area) the ability to re-run as at some particular point in time.
It is arguable that your ZIP files already constitute the post-extraction staging area.
Some sites even use two staging areas, one immediately post extraction and one of load-ready data. This liberates the transformation phase from any time dependency upon the extraction or load phases.
Don't forget also that extraction includes extracting data from the target system, to populate lookup tables expected by the transformation phase.
There is no good reason - ever - to have a Sequencer with a single input and a single output in a job sequence.
It means, among other things, that you only have to perform extraction once (important for "point in time" extraction); that you can have separate time windows in which extraction and loading can occur; that you have restartability and repeatability in your design and (with suitable archiving of the staging area) the ability to re-run as at some particular point in time.
It is arguable that your ZIP files already constitute the post-extraction staging area.
Some sites even use two staging areas, one immediately post extraction and one of load-ready data. This liberates the transformation phase from any time dependency upon the extraction or load phases.
Don't forget also that extraction includes extracting data from the target system, to populate lookup tables expected by the transformation phase.
There is no good reason - ever - to have a Sequencer with a single input and a single output in a job sequence.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The vendor used to have a training class called DataStage Best Practices but I fear it is no longer extant. It was for server edition only (enterprise edition not having been released at the time).
Apart from that, and anything developed in-house at various sites, the answer is no.
Apart from that, and anything developed in-house at various sites, the answer is no.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 145
- Joined: Fri May 02, 2003 9:59 am
- Location: Seattle, Washington. USA
The only thing that I would add is that I would suggest that you use a parallel dataset instead of a sequential file to stage the data. Since the dataset is native to EE you can read and write to the files in parallel where with the sequential file you cannot.ray.wurlod wrote:Best practice is, indeed, to stage your data at least once between extraction and loading. It is not necessary to do so, merely wise.
It means, among other things, that you only have to per ...
Shawn Ramsey
"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams