Schema files and RCP via Hamlet: To use or not to use?

FranklinE · Post by **FranklinE** » Wed Aug 18, 2010 9:59 am

I'm building a new application that on the surface is relatively simple. I have a list of data sources that are mostly mainframe datasets with a shorter list of copybooks involved, meaning that some record formats are for more than one dataset.

My goal is to parameterize the application as much as is practical.

I'm looking for general experience and/or advice on using schema files, and the difficulties it introduces in exchange for not having to explicitly define columns for every DS table. I want to avoid RCP as much as possible, because from the production support view this makes finding and diagnosing problems more difficult unless a DS person is brought in for every problem. We have a centralized support group that has no plans (for now) to have their own DS technical resource.

The basic job design I'm using as a template is also simple: Read the source (using FTP Enterprise as much as possible), if necessary run through a filter to remove data not needed for that feed, use a transformer to a common format for the load on the other side, and write to a sequential file (or FTP to a staging area) in the format the load process requires. The final destination is an Oracle database created for a vendor product that the internal clients will use. The vendor is providing a proprietary load utility, hence the common load format.

I know how to use RCP and schema files. I'm looking for general advice that will help me decide at the design level if I should use schema files.

Thanks,
Franklin

ArndW · Post by **ArndW** » Thu Aug 19, 2010 9:38 am

I'll start at the end - in your case you mentioned that production support with RCP is more difficult than not using it. I have found that the production staff never use a designer client to look at DataStage jobs in order to determine what went wrong. I'm not equivocating and using the terms "rarely" or "infrequently" but stick by "never" - and I've been doing DataStage for a fair number of years. Thus, from a production support point of view the use of RCP is a moot point - the staff will have a playbook with some information on error messages and restart methods, but if an error occurs within a job that isn't due to something obvious outside of DataStage it will always be passed on to the 2nd level support which, as often as not, is the development team.

The current project has numerous flat files coming from various host systems. The COBOL copybooks have all been put into database tables along with additional metadata information (which columns are to be dropped, reformatted or are nullable or keys) and this is dynamically processed at runtime to make schemas and the load process for hundreds of files runs through a single DataStage job. This makes dynamic changes relatively simple to perform and doesn't clog the repository with one-job-per-file (but does put a lot of instance names of the load job in the director log).

The above works because the flat file data is not transformed when going to the staging area - once you need to manipulate many columns the use of RCP isn't helpful, since columns need to be declared specifically in order to change them.

All in all it is a matter of taste and preferences. Doing a clean RCP implementation is a lot of work up front and many development headaches; not using RCP can get prototypes up and running much faster but can make modifications more time-consuming - imagine a big job with 50 stages and you've modified CUST.ID to be nullable and an integer instead of the original string format... with RCP you might need only change your source schema or source stage instead of having to edit each and every stage where CUST.ID is used.

The development team - if more than yourself - needs to understand RCP as well and that can take time and energy. Anyone joining the development or support team will need to learn the concepts and functionality behind RCP and that generally means a longer learning or training process before they are productive.

FranklinE · Post by **FranklinE** » Thu Aug 19, 2010 12:49 pm

Your points are well taken. I'll just clarify that our host production environment is mostly several million lines of COBOL/JCL under thousands of jobs that run cyclically. When we established centralized production support (I was there for the "bad" old days when the development team provided all production support) it quite reasonably emphasized the appropriate skills. Dedicated Unix, MVS (now z/OS) and database (DB2 and Oracle) support resources came later.

My design will accomodate the fact that 10% or less of the issues will be DataStage-specific. I must also note that our user authorization scheme prevents support center users from having access to Director. Part of my design will be extracting Director log messaging to log files to get around that. We do pretty well, for all that we tie one hand behind our backs at times.

Anyway, the application I'm building now might be small enough to make using dynamic formatting and RCP practical. I'm just a bit gunshy, having spent too many hours tracking production executibles back to Designer code just to figure out what some error messages really meant.