specs and best practices

Jay · Post by **Jay** » Wed Dec 10, 2003 9:36 pm

Hi all,

Is there any kind of documention which specifies DS standards and best practices.

I searched DSxchange, but a post said there are none whatsoever.Is this true?

Thanks in advance
Jay

kcbland · Post by **kcbland** » Wed Dec 10, 2003 9:47 pm

Please narrow your request down. DataStage is like a Swiss Army knife. There are multiple facets to the product, from realtime to Server to Parallel to Quality to MetaStage to Quality Manager, etc.

In addition, how you intend to use the product is also important to state. There are batch warehouses, trickle-fed/message fed warehouses, realtime warehouses, data migration efforts, data bridging efforts, and data publishing efforts.

There are so many uses and tools, please state how you intend to use which products. Then, we may be able to provide you with some publicly held documentation (non-Ascential written with no copyright infringements or screenshots), or at a minimum point you to posts that articulate the concepts you require.

Keep in mind also, your seemingly short question is vast in its reach, because tool practices inevitably lead back to things such as data modeling choices, as well as hardware considerations, programming styles, software lifecycle methodologies, software migration practices, etc.

Prepare for a tidal wave of opinion, but please state your situation clearly and completely.

kcbland · Post by **kcbland** » Wed Dec 10, 2003 9:49 pm

And prepare to go read some recommended books, visit some websites, attend user conferences (TDWI, DCI, AscentialWorld), and attend training...

Jay · Post by **Jay** » Wed Dec 10, 2003 11:14 pm

hi kcbland,

our project is starting from scratch. we want to build our data warehouse using Bill Inmon's approach. data is in Terrabyte range. so we'll have to use PX.

before we start we want follow some standards. can i get some info on standards and best practices?

thanks in advance,
jay

kcbland · Post by **kcbland** » Wed Dec 10, 2003 11:36 pm

Okay, first of all, terabyte warehouses do not require PX. You can write ETL in perl if you were so inclined.

Second, you're noted as being Windows, Server, release 7, and therefore cannot run PX.

Third, I'm not aware of Bill Inmon (or Claudia Imhoff, Joyce Bischoff, Sid Adelman, Douglas Hackney, Ralph Kimball, etc) having an ETL approach. They've written a lot of theories and practices and pieces and parts to the nebulous term "data warehouse". You have to specify components you want so we can give suggestion as to how best to write ETL for them.

So, what do you mean by Inmon's approach? Are you building every object on the CIF, the GIF, or the HIF? (I made up HIF, I'd call it the healthcare information factory.) Anyway, it's late and I'm in a foul mood. So, are you talking about building an ODS? How about an atomic-level EDW? Are you building star marts? Any real-time aspects? What about data quality? Are you interested in building a persistent staging database?What kind of hardware (I hope you're not building an ETL solution on Wintel for a terabyte+ warehouse, I like a challenge, but that's pushing it.)

The reason this needs to be answered is plenty. Let's go all out and build the Inmon/Kimball Intergalatic DeathStar of Information Warehouse. You've got legacy file feeds, Peoplesoft with 15K tables, Siebel with proprietary and no licensing for direct access to the DB, some scattered SQL-Server databases, and the occassionally malformed Excel spreadsheets. You elect to build a Persistent Staging Database, foregoing a 3NF model as Karolyn Douglas recommends and instead build a flattened structure highly mirroring the source data. You're going to need some heavy duty data scrubbing and validation going on here. You're also going to have to use a pretty insensitive approach to handling the source data because of the quality. There's a bunch of experts here who can give advice and recommendations for the multitude of aspects to deal with on scattered source data. Please narrow our focus.

Next, let's you say you don't need an ODS, so you'll go straight for the EDW. It's an atomic-level design, so you will never update any row for any reason what so ever. This will cause you to have to design some sort of sandbox into your ETL framework, because every row has to be interrogated against the most current variant of the row in the table and if any column of import is different then you insert the next variant. If you have multiple occurrence of the source row in the datastream, the sandbox approach becomes helpful. But then, PX operates fastest without a sandbox, so I guess you have to describe your data.

But then we get to talk about what your target database will be. That can affect some of the ETL strategies (most likely the loading aspect of the target, but also reference lookup generation). If the target for the EDW is Teradata or Redbrick, you will be working thru their loaders, whereas Oracle and DB2 and Informix and SQL-Server and Sybase have OCI connectivity, as well as bulk loaders. So, we need to you help us out and tell us what type of database.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 10, 2003 11:49 pm

But don't be disheartened by that. :D

As Ken says, he was in a foul mood at the time. (That said, everything in his post is correct.) Lay your hands on the Kimball book (search this Forum for details) to see how to design and implement DW properly.

Here's a couple of best practices to get you under way.

1. Plan, plan, plan. Study (get trained), then plan some more.
2. Do something achievable for phase 1.
3. Set up a documentation store (e.g. a shared Windows folder) and document everything you do, straight away. It's one of those tasks that tends to be put aside for "later", which never arrives.
4. Become familiar with all the things the tools of choice can do, and what other tools are out there. For example, since you're beginning, this is the ideal time to incorporate MetaStage and Parameter Manager into the mix.

Stay tuned to the Forum, and please ask more questions, we love to help (even though it might not seem that way sometimes).