extraction and loading

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kavuri
Premium Member
Premium Member
Posts: 161
Joined: Mon Apr 16, 2007 2:56 pm

extraction and loading

Post by kavuri »

Hi,
here in our project we are extracting data from zip files and then validating the data then actually we need to write this data into Netezza database. But in the jobs already developed what they have done is they are writing data into a sequential file and the in another job they are writing from sequential to Netezza. And they are using a sequencer in between these jobs. i.e this is another sequencer job.

When I asked mu collegue she told that you should not do extraction and loading in the same job. Is this correct. In commonsence I am unable to agree this. Because you are writing into a text file is also some sort of loading, what I understood.

Can Gurus can give me any explanation. Please tell where I can find guidelines in designing jobs.


Thanks
Kavuri
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Best practice is, indeed, to stage your data at least once between extraction and loading. It is not necessary to do so, merely wise.

It means, among other things, that you only have to perform extraction once (important for "point in time" extraction); that you can have separate time windows in which extraction and loading can occur; that you have restartability and repeatability in your design and (with suitable archiving of the staging area) the ability to re-run as at some particular point in time.

It is arguable that your ZIP files already constitute the post-extraction staging area.

Some sites even use two staging areas, one immediately post extraction and one of load-ready data. This liberates the transformation phase from any time dependency upon the extraction or load phases.

Don't forget also that extraction includes extracting data from the target system, to populate lookup tables expected by the transformation phase.

There is no good reason - ever - to have a Sequencer with a single input and a single output in a job sequence.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kavuri
Premium Member
Premium Member
Posts: 161
Joined: Mon Apr 16, 2007 2:56 pm

Post by kavuri »

Hi,
Are there any best practices guide?

Thanks
Kavuri
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The vendor used to have a training class called DataStage Best Practices but I fear it is no longer extant. It was for server edition only (enterprise edition not having been released at the time).

Apart from that, and anything developed in-house at various sites, the answer is no.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kavuri
Premium Member
Premium Member
Posts: 161
Joined: Mon Apr 16, 2007 2:56 pm

Post by kavuri »

Thanks Ray, Now what should I do? Do I need to put it resolved?

Thanks
Kavuri
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you think it is, then yes. And maybe start your own best practices book. Tip: use your Favorites folder here.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Post by shawn_ramsey »

ray.wurlod wrote:Best practice is, indeed, to stage your data at least once between extraction and loading. It is not necessary to do so, merely wise.

It means, among other things, that you only have to per ...
The only thing that I would add is that I would suggest that you use a parallel dataset instead of a sequential file to stage the data. Since the dataset is native to EE you can read and write to the files in parallel where with the sequential file you cannot.
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
Post Reply