datawarehousing

Saama · Post by **Saama** » Fri Mar 16, 2007 1:32 pm

Hi Gurus,

what is the best practice in datawarehousing, what is maximum amount of data that a source stage can handle in parallel version.

we break the data in chucks and handle it separetly.

what are the maximum columns or rows that .

please help me in analyzing.

cheers;
saama

DSguru2B · Post by **DSguru2B** » Fri Mar 16, 2007 1:36 pm

For implementing datawarehouses, wow, thats a huge conversation. I suggest you read Ralph Kimbal or Inmon books.
A source stage can handle any amount of data, there is no limitation.
If you divide it in chunks, thats ok. This way you can extract in chunks by utilizing multiple instance jobs.

kduke · Post by **kduke** » Fri Mar 16, 2007 4:20 pm

I would look at some key tables. Count the rows. Try to estimate the row lenght by guessing the average row length. Take the largest table row length times number of rows and make sure you have several times that amount of disk maybe 10 or 20 times. WAG is maybe the best you can do unless you want to hire soemone who can do a better job of estimating because you are going to need to do a lot more work to figure what tables you are sourcing and how many times you plan on landing the data. I am sure Ray's analysis is a lot more complicated than my simple solution to try and get you an idea of where to start. Most of this is from we call common sense or maybe experience. Something that there is no short answer for.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 16, 2007 6:38 pm

Data are never "kept" within DataStage, so there's no limit on how much data a stage can handle. Data just flow through DataStage.

A well-considered design will include one or two staging areas, for the purposes of restart/recovery, but theoretically (and as advised by DataStage sales folks over the years) the key to performance is never to touch your data down to disk - just keep it streaming through DataStage.

It works, too, if nothing goes wrong and you continue to have access to source and target systems, and can do all this within the allocated time windows. But a little caution is a Good Thing.

kumar_s · Post by **kumar_s** » Sat Mar 17, 2007 2:53 am

"we break the data in chucks and handle it separetly. "
How is this been handled? Row wise split up or columns wise split up? Though some of your recent post seems to be "Interview Question", you could explain on how do you implement or plan to implement this. At times, it will lead to inefficient unless otherwise its been handled carefully.