extraction and loading

kavuri · Post by **kavuri** » Thu Dec 20, 2007 6:56 pm

Hi,
here in our project we are extracting data from zip files and then validating the data then actually we need to write this data into Netezza database. But in the jobs already developed what they have done is they are writing data into a sequential file and the in another job they are writing from sequential to Netezza. And they are using a sequencer in between these jobs. i.e this is another sequencer job.

When I asked mu collegue she told that you should not do extraction and loading in the same job. Is this correct. In commonsence I am unable to agree this. Because you are writing into a text file is also some sort of loading, what I understood.

Can Gurus can give me any explanation. Please tell where I can find guidelines in designing jobs.

Thanks
Kavuri

ray.wurlod · Post by **ray.wurlod** » Thu Dec 20, 2007 7:44 pm

Best practice is, indeed, to stage your data at least once between extraction and loading. It is not necessary to do so, merely wise.

It means, among other things, that you only have to perform extraction once (important for "point in time" extraction); that you can have separate time windows in which extraction and loading can occur; that you have restartability and repeatability in your design and (with suitable archiving of the staging area) the ability to re-run as at some particular point in time.

It is arguable that your ZIP files already constitute the post-extraction staging area.

Some sites even use two staging areas, one immediately post extraction and one of load-ready data. This liberates the transformation phase from any time dependency upon the extraction or load phases.

Don't forget also that extraction includes extracting data from the target system, to populate lookup tables expected by the transformation phase.

There is no good reason - ever - to have a Sequencer with a single input and a single output in a job sequence.

kavuri · Post by **kavuri** » Thu Dec 20, 2007 8:02 pm

Hi,
Are there any best practices guide?

Thanks
Kavuri

ray.wurlod · Post by **ray.wurlod** » Thu Dec 20, 2007 11:41 pm

The vendor used to have a training class called DataStage Best Practices but I fear it is no longer extant. It was for server edition only (enterprise edition not having been released at the time).

Apart from that, and anything developed in-house at various sites, the answer is no.

kavuri · Post by **kavuri** » Fri Dec 21, 2007 3:24 am

Thanks Ray, Now what should I do? Do I need to put it resolved?

Thanks
Kavuri

ray.wurlod · Post by **ray.wurlod** » Fri Dec 21, 2007 3:42 am

If you think it is, then yes. And maybe start your own best practices book. Tip: use your Favorites folder here.

shawn_ramsey · Post by **shawn_ramsey** » Wed Jan 30, 2008 9:57 am

ray.wurlod wrote:Best practice is, indeed, to stage your data at least once between extraction and loading. It is not necessary to do so, merely wise.

It means, among other things, that you only have to per ...

The only thing that I would add is that I would suggest that you use a parallel dataset instead of a sequential file to stage the data. Since the dataset is native to EE you can read and write to the files in parallel where with the sequential file you cannot.