DS: all in one project, or separate projects?

victorbos · Post by **victorbos** » Tue Jul 08, 2003 3:07 am

Hi all,

I am an experience DWH-developer/architect (in the business since 1991), but new to Datastage (only basic training in server and parallel extender).
I am now working on the startup for a large data warehouse: it will be the central DWH for an insurance company with 5500 employees. At the moment we are facing many questions, whose answers will be leading to the way we will work for the coming years.
The first thing to ask the experts here is:
How to structure the jobs, table definitions, etc in such way that each data warehouse subsystem is separately maintainable and customizable: Do we put everything together into one project, and take care for a neat hierarchy inside that project, or do we create separate projects for each subsystem so that the cleavage is maximum?

tia for all help,

Victor

---
Victor Bos - Datastage developer, the Netherlands

chulett · Post by **chulett** » Tue Jul 08, 2003 6:50 am

I believe that the determination is up to you. Either method will work.

You can go with the minimum of three projects for the entire endevour - development, test and production (well, four really, as Version Control should be high on your list of Things To Use and it requires a seperate project[:D]). Within the dev project, you can use Categories (for jobs) or Folders (for other components) and an intelligent naming convention to segregate your various 'subsystems' and they will flow upstream.

Or you can have three projects per subsystem. You might consider that for security reasons in the future... I believe the next version of DataStage will allow access restrictions per project. Also for managability, depending on the number of jobs you think you might end up with. We did this with our two seperate 'initiatives', as DataStage did not support categories when we first set it up, but they could easily be folded into a single 'set' of projects nowadays.

Others will chime in here, but I don't believe there are many items that are only customizable at the project level, so I'm not sure you'd need to worry about segregating things by project for that reason, other than possibly security as mentioned above.

Hope this helps,

-craig

vmcburney · Post by **vmcburney** » Tue Jul 08, 2003 8:38 pm

Having a single project for each environment makes it easier to share routines and table definitions. You may find that jobs from different subsystems are dependant and you need a sequence job that crosses between both.

The only problem I've had with a really big project is the version control tool, the one I used didn't support folders so it just had a long list of jobs and routines.

You also need to plan a heirarchy for the input and output files that are stored in the file system such as sequential files and hash files. If you have development and testing on the same machine, or multiple copies of development you will need a seperate set of folders for each project.

Vincent McBurney
Data Integration Services
www.intramatix.com

kcbland · Post by **kcbland** » Tue Jul 08, 2003 9:14 pm

I prefer to follow a lifecycle framework when deploying iterations to the data warehouse. As a warehouse matures, you find out where you made short sighted decisions in the initial framework and methodology. You're asking the right questions right off the bat. I look at the longterm affects of building 500+ jobs to support the load of an enterprise datawarehouse. You're going to iterative releases where more subject areas and tables are deployed, as well as bug fixes and corrections to existing tables and processes. You're going to want to setup job design standards, shared/common libraries of functions, routines, etc. You're going to need to do things like setup a consistent parameter framework where all of your jobs are location/host independent.

I recommend that you consider the merits of a release control system, whereby Iteration #1 of the warehouse is assigned a release name. You will create the project under that release name. Example, EDW_1_0. As you build and work on the next release for EDW_1_1 in a same named project, EDW_1_0 is available for bug fixes to the current production release. EDW_1_1 would be a full code set of EDW_1_0, plus the enhancements and revisions. Since you're going to have to coordinate database changes with the implementation of the code changes, this approach will allow you to create the EDW_1_1 project in the production environment ahead of time, and deploy a completed release into the project and have it compiled and in place awaiting your implementation day. On implementation day, you have the database changes enacted for that release, and you update whatever schedule pointers are necessary to reflect that EDW_1_1 is now the current released project.

You have a challenge as the ETL Architect/Warehouse Architect in coordinating code changes with database changes. You didn't mention anything along the lines of the presentation layer (Business Blobjects, MicroTragedy, etc), but if you are dealing with a matured warehouse you now have to toss in the OLAP reporting changes that go along with the database changes. It's a fully integrated approach and you have to consider the full lifecycle crossing from back room to front office. By versioning your full EDW application at a project level, you gain the ability to work on maintenance issues on the current release (EDW_1_0), while developing the third generation release (EDW_1_2), while the second generation (EDW_1_1) is undergoing user acceptance testing. This type of approach follows the iterative approach espoused by Inmon and the lifecycle of Kimball.

As for downstream data marts, they too should be developed in versioned projects, as each datamart could be running on its own independent lifecycle. You may have a datamart enhancement that is running on a delayed reaction because the EDW release changes aren't in place. Trying to bring a datamart enhancement online at the same time as an EDW enhancement can be VERY difficult. If you have a high volume issue with a backfill, this could take the datamart offline during the backfill process, which may be undesirable. If you're running independent teams working on the EDW and the marts, this coordination effort is like the Russians and the US working on the international space station: sometimes things go wrong in the translation and the parts don't mate perfectly. (Bad example, the US does all of the work, then loans the money to the Russians, who then turn around and pay us for the work they were supposed to do. Wait a minute, that might be a good example.)

So lastly, and with my last gasp of breath, you're going to have a common/shared library of routines and functions you're going to want to keep synchronized across all of the projects. You're going to have shell scripts, SQL scripts, control files, lots of pieces and parts outside of the ETL tool. You're going to want a good version control product that can manage everything, from your data model to your shell scripts to your DS objects. I recommend something robust like PVCS to manage your objects. From that standpoint, you can now tag and manage your common/shared objects across the board and package a complete release.

On that note, you're going to want to setup release versioned directory structures within your system environment so that each version can have a discrete working environment. Your scripts, control files, etc have to follow the release, as well as DataStage's working directories for sequential files, hash files, and other asundry pieces. For example, /var/opt/EDW/edw_1_0 could the base directory for assorted subdirectories to contain runtime files, as well as the script, control file, and log folders for your release. This approach allows you to seamlessly migrate your application across the multiple hosts required for development, system test, user acceptance/QA testing, and production.

Good luck!

Kenneth Bland

victorbos · Post by **victorbos** » Wed Jul 09, 2003 1:02 am

Thanks a lot everybody, for the wise words.
Especially Kenneth for his free copy of "Data Warehouseing for dummies" [;)]. It really helps a lot! I am going to be a frequent visitor to this forum.

However, I am still in doubt about the project setup.........

best greets,

Victor

---
Victor Bos - Datastage developer, the Netherlands

ray.wurlod · Post by **ray.wurlod** » Wed Jul 09, 2003 6:15 am

Based on the KISS principle, I would advise that you use separate projects for each subject area. You may also require a project for things such as conformed dimensions that are common to more than one subject area.
And, where I have specified "project" here, it really means three projects, one for development, one for testing and QA, and a production project.
The real secret is to plan it out first, then document the plan so that anyone joining your team can know the strategy quickly.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

theheat · Post by **theheat** » Wed Jul 23, 2003 3:48 pm

Hi all, very similar position to Victor, have the industry experience but lack the tool experience.[:I]

We have three layers to our version 1 warehouse environment. A warehouse environment that is being structured to support multiple subject areas but initially only supports a single Financial Data Mart SA.

Layers:
1. Persistent Storage Area (PSA) Layer
- Where we are simply sourcing data from transactional, production systems into
replicated tables within warehouse environment.

2. EDW Layer (initially conformed dimensions, moving to true EDM)
- Where we have our "master tables", i.e. where the majority of the ETL work is
is required to produce conformed, normalized data stores.

3. Dependent Data Mart Layer
- Subject specific star schemas sourced from EDW layer.

We have created DS jobs that load each environment separately and successfully, i.e.
Source->PSA->EDW->DDM. Fantastic. Now I want to apply best practices with the future
use of the various layers in mind. Is it best to:

A) Have DEV, TEST, & PROD projects for each layer:
- e.g. PSA_DEV, PSA_TEST, PSA_PROD, EDW_DEV, EDW_TEST, etc. where:

PSA_DEV
|--> Finance
...........|-> Job 1
...........|-> Job 2
|--> Subject Area 2
...........|-> Job 4
...........|-> Job 5
|--> Subject Area 3
...........|-> Job 6
...........|-> Job 7

OR

B) Have three projects and sub-categorize each layer with in:

DEV
|--> PSA
....|--> Financial
...........|-> Job 1
...........|-> Job 2
....|--> Subject Area 2
...........|-> Job 3
...........|-> Job 4
....|--> Subject Area 3
|-->EDW
....|--> Financial
...........|-> Job 5
...........|-> Job 6
....|--> Subject Area 2
...........|-> Job 7
...........|-> Job 8
....|--> Subject Area 3
|-->DDM
....|--> Financial
...........|-> Job 9
...........|-> Job 10
....|--> Subject Area 2
...........|-> Job 11
...........|-> Job 12
....|--> Subject Area 3

And then an additional project with same structure for TEST, and PROD.

Option A) seems a tighter design as there is a better separation of your layers, but I see there being a problem with using DS when wanting to use the sequencer, as, correct me if I'm wrong, you can not sequence jobs across projects? For example have a single sequence that allows me to control the loading of my PSA, then EDW, then DDM layers sequentially. To perform such a job I would have to use a scheduler instead that runs each job separately. Not ideal.

Option B)Nice for a single a "project" but does not provide separation of DW layers/environments.

Big proponent of KISS methodology, but that doesn't always work when you are dealing
with a large environment with multiple developers while trying to enforce reusability
and consistency while adhering to enterprise standards.

Any further input on best practices here would be greatly appreciated.

Regards,

Glyn.

kduke · Post by **kduke** » Wed Jul 23, 2003 4:15 pm

Glyn

I seen it done both ways. I think it depends on the number of jobs you expect to have. The number of jobs somewhat corresponds to the number of tables in each target. There is a trade-off between in benefits gained by having less projects and the complexity of MetaStage and Reporting Assistant. These tools can extract the ETL business rules and allow you to report against them. I would try to keep the number of jobs under 250 in each project. If it gets over 1000 then you see some performance loss to browse through the jobs. DataStage itself seems to take longer to do things like pull a job up. Some platforms have a lot less of an issue with this.

If you can separate your jobs into projects that never overlap then do it. If there is some overlap in functionality then you cannot easily run jobs in 2 separate projects. Reusablity is not an issue. Jobs usually cannot be reused. Routines are easily copied from one project to another. Routines are seldom changed. Either they work or they do not work. Replicating metadata is not a problem either. It does not take long to reimport table definitions or export them and import them into another project.

If you do not separate then you may issues in isolating sensitive data. Financial data may be sensitive and need specific developers working on it.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com