Project structure

RodBarnes · Post by **RodBarnes** » Tue Apr 18, 2006 4:46 pm

I'm at a point where I am considering a change in our project structures. I'm curious what others may be doing -- what "best practices" there may be that have proven helful. Maybe someone has a document somewhere they'd be willing to share on the ins and outs, pros and cons of particular methods?

Initially, we took the approach of a DS project containing all sequences and jobs and we would have them separated into categories/folders inside the project for each of the particular ETL streams. This worked for a while and may still be a good way to proceed but I am finding promotion time tiring as I have to select individual jobs (the modified jobs) to promote out of a huge list.

It has got me thinking about altering the structure so that each stream is in its own project with the project named to include the category. We have a development, test, and production area which each have their own projects to which the jobs/seqs are promoted using DS Mgr or Version Control.

For example:

We have "Aardvark" as a project and within the Jobs folder are categories -- "Alpha", "Beta", "Delta" -- within which are the individual jobs and sequences for each of those streams. When it comes time to promote to the test project using import/export of DS Manager, the combined set of jobs and sequences for all three streams gets very long and is annoying to search for the particular two jobs in the Alpha stream that were modified and need promoted.

Using this example, I would alter this to have three proejcts -- "Aardvark-Alpha", "Aardvark-Beta", "Aardvark-Delta" -- which would just have the jobs and sequences directly below the Jobs folder. Then, when it is time to promote, I will only see the list of jobs/seqs in Aardvark-Alpha rather than those for all three streams.

Thoughts? Feel free to weigh in. All opinions welcome.

kcbland · Post by **kcbland** » Tue Apr 18, 2006 4:54 pm

Sound idea. Seems to me a better import/export Wizard is in order.

Something with "Tagged by developer" or "Searching for changed since" capabilities, etc.

ray.wurlod · Post by **ray.wurlod** » Tue Apr 18, 2006 7:00 pm

Major changes in the next release ("Hawk"). There will no longer be any fixed categories. You can store any mix of components in any folder.

kduke · Post by **kduke** » Tue Apr 18, 2006 10:53 pm

My personal preference is to save a job in \Save\OriginalCategory with JobNameYYYYMMDD. I prefer to have all jobs grouped by target table. So the category is target table name like CustomerDim for CUSTOMER_DIM. I like Pascal names. Job names should maybe start with Src, Tgt or Stg for Source, Target and Staging. Next part of job name is either Source System or Target table or just descriptive. You can use numbers in the name to suggest the order processed but you need to leave room for new jobs.

SrcAs400CustStg where AS400 is a DSN
Src010As400Cust where AS400 is a DSN
Src020WebCust where Web is a DSN
Stg050MergCust
Stg060Cust2Qs where customer records are cleansed in QualityStage

I think this naming convention also works.

Cust_010_E_AS400 where customer records are extracted from DSN AS400.
Cust_020_T where customer records are transformed.
Cust_030_L where customer records are loaded.
Cust_999_Seq_010 first customer sequence

Category names like

Production\CustomerDim
Production\InvoiceFact
Saved\CustomerDim
Saved\InvoiceFact

I do not like categories like:

Production\As400
Production\Web
Production\Sequence

I do not like jobs separated by source system. Who cares where it came from. It is more important where it is going. Also I do not like sequences in a different categories. It is very difficult to understand ETL when each job is in a different category. The categories isolate jobs by subject area. The most important subject area to me is the target table. Treat aggregate tables as a new subject area. The categories are there to help walk you through the jobs. The job names should also do the same. Adding Src or Tgt or Load or Extract to a job name is very informational. Numbers are not as informational. Source system as part of a job name is also very informational. Who cares if this job is ODBC or OCI or what stage types are in the job.

You need to add a column to a target table is usually the starting point for change to ETL. If you are new to a project then how do you find the job which writes to this table? Next how do you find the job right before this job so you can add this column to it as well? Repeat this process all the way back to the source column and table. Categories and job names are critical to speeding up this process. The difference between good implementations of DataStage and poor ones are in the details.

Are your job names useful?
Are your routine names useful?
Are your categories useful?
Are your database connections parameterized?
Are your path names for sequential files and hashed files parameterized?
Are your jobs documented?
Are your naming conventions documented?
Are your source jobs consistent?
Are your load jobs consistent especially type 1 same?
Type 2 loads the same or consistent?
Fact job loads consistent?
Aggregate jobs consistent when possible?
Have you maximized your transaction size and array size for each job?
(Varies by row width, so each job maybe different)
Do you have row counts?
How is your metadata managed?
Do you know which job uses which table?
Where used reports?

This is a good checklist for any ETL project. Grade yourself and/or your existing projects. How well is your project implemented? Ascential/IBM gives out awards for their favorite customer at the moment. I would love to see how many of these checklist items they have implemented well or poorly. Wouldn't you love to know too? Who i doing it right or better?

I hope as part of being a premium member you would have access to naming convention documents created by Ray or Ken. Get a white paper on each of these checklist items. Here is how Craig does it. How cool would that be? I would also like to see an example of each stage type used in a production job by some user, not necessarily Ray or Arnd just a good example. I would also love to see us put together documents to help you get certified. Vincent has already started to help in that area. Clear definitons of what dssearch is. Never heard of it till Vincent said it is a test question. Why would this command be a test question? Certification should separate those who can from those who cannot implement good DataStage solutions.

kumar_s · Post by **kumar_s** » Wed Apr 19, 2006 7:22 am

Also it would be clean to have single project with multiple (controlable) categories rather than having multiple projects for each category. Logical seperation for project can be as DEV, TEST and PRD.
There can be multiple projects for the same class, for different versions.

IMHT

newtier · Post by **newtier** » Fri Jul 14, 2006 12:34 pm

The answer for you really depends on factors for you company, such as how many jobs, how many different applications (business projects), and how you need to handle securtity.

Just keep in mind other factors, such as DataStage upgrades, patches, etc, that must must do things for "every project". The more projects, the longer these things can take and the bigger the pain in the rear.

For promotions, we use the Version control tool, which provides many options, such as creating "batches".

(Note: with Hawk the VC product will go away in the initial release.) Hopefully they will tie it to an open API, or at least to the ClearCase product. )