Best Practices: Jobs as SOA

dougcl · Post by **dougcl** » Thu Jun 24, 2010 5:02 pm

Hi folks, first post here.

We are brand new to DS, and are attempting to get our first job running as a pilot to motivate best practices before taking on more.

I see many references to best practices here on the forum which is great, but I wonder if the many best practices have been gathered together and documented here or elsewhere.

Specifically, I am interested in the use of datasets in jobs. It seems to me that building each interface as three jobs: extract, transform, and load using datasets between makes a lot of sense from a lot of angles, not the least of which is encouraging documentation and reuse of processed data among our datastage developers. The obvious con is the use of disk space. I am inclined to spend disk space to achieve all of the benefits, and I believe that I can argue this successfully with our server folks, but I would like to hear opinions from the experts here.

Please forgive me if this has been answered ad nauseum. I did a cursory search here first, but these are general search criteria.

Thanks,
Doug

eostic · Post by **eostic** » Fri Jun 25, 2010 6:29 am

Hi Doug....tell us a bit more about the patterns you are looking at.....your title is Jobs as SOA (which often is a request/response pattern that rarely has a persistent store), but your initial question is regarding DataSets.... all good questions and thoughts --- just want to get an idea of what you are looking for. There are various documents floating around. (among other resources)..here in the forum, and in redbooks, and in the formal documentation, etc... ....tell us more and we can help point you in the right direction or to the right material.

Ernie

dougcl · Post by **dougcl** » Fri Jun 25, 2010 7:11 am

Hi Ernie,

At this point I would like to consider design principles that span all jobs, no matter the job details, insofar as such considerations are possible.

For example:
If I propose that all inbound data first be landed in a dataset before processing, and that the landed datasets be richly documented, then developers can browse the available datasets rather than consulting the source data. The implication here is that all downstream uses of this data have a common upstream job into which common code can be deployed. Developers downstream need not concern themselves with the internals of the upstream job, and vice versa, provided each developer works to the dataset metadata. The jobs are loosely coupled around a rigorous interface based architecture (based on datasets). In this situation, some developers can work with the upstream, some with the downstream, and each can become experts in their domains. Without this in place, each developer may be inclined to retrieve data from the source tables redundantly and without consistent treatment. I am guessing that this approach should also be applied at the output, right before the data is written to the target tables. I envision some developers who are familiar with the source work the source side, others who are familiar with the target, work the target side. Some developers work the middle, and map datasets to datasets (no direct source or target access). The resulting jobs are less complicated, and the whole thing is held together with a couple (or more?) layers of sequencers.

There are many other benefits that result from this approach, I think, and it seems to correlate to some long standing ETL design practices. However, it means that I need probably a few TB of disk space, and I want to advise that on more than a hunch. There is also some question here about whether loading from local datasets will be slower than from the source database. Seems to me it should be much faster.

So the idea is this... if you could do it all over again in a completely green field system, what would you do? We have at this point one big source and one big target. Start with the broadest strokes possible. This is the kind of opportunity I think a lot of people dream about. I wish I had already been through a few implementations to incorporate lessons learned, but I haven't. That's why I am asking here.

Thanks,
Doug

eostic · Post by **eostic** » Fri Jun 25, 2010 8:24 am

Ok.... this is certainly a scenario of best practice that is way larger than DataStage alone, and I'm sure we'll have lots of opinions here.

At the same time, it is very broad, and not related to something like SOA. Best probably to recognize that there are hundreds (thousands?) of different approaches and patterns, each with their own justifications.

If we stick primarily to your area described above, I would say that "yes", I see a lot of sites doing such things.....the "intermediate" location may not necessarily be a DataSet, nor might it even be a persistent store. I see and hear the same concept often applied to services where a central aggregate model is created that every transaction, legacy system, partner application, etc. must conform to.....and it might just be an in-memory transitional model that lives for several milliseconds.....but the concept is generally the same......allow downsteam processes to be isolated from new, upstream customized additions.

I'm sure many here in the forum will have suggestions and ideas that are generic and/or applied directly to DataStage for your specific use case.

Ernie

dougcl · Post by **dougcl** » Fri Jun 25, 2010 9:03 am

Thanks Ernie, not to quibble, but I see this as very much aligned with the principles of SOA. But perhaps the term is now so loaded with Web Services connotations that I shouldn't use it. I admit to authoring a provocative subject line.

Thanks for your feedback, and yes I think a dataset is an arbitrary basis for an interface, but it is one that presents itself in the DataStage environment. I would love to hear from folks who are attached to alternatives. Accomplishing this aim without requiring disk space would be attractive. However, it seems datasets (more generally, persistent objects) are uniquely positioned to decouple processes temporally.

Regarding a standard to which all things conform, we aren't there. I can see how to get there, I think, but for now the standard is set by the source table and target table metadata. The datasets merely match these schemas, since my proposal is to use them immediately after extracting, and just prior to loading. It seems to me that in the long run the most likely candidate for a standard is one that represents business entities (eg data marts), worked out with end user input, rather than something tied to either source or target schemas. Source and target implementations need to be able to float relative to the standard.

More generally though, now that we have an example of what I am considering, is there a document out there, like a FAQ, in which this topic (among others) is discussed in the context of "best practices?" In other words can I take this offline and do some reading first?

Thanks,
Doug