datawarehousing

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Saama
Premium Member
Premium Member
Posts: 83
Joined: Wed Nov 22, 2006 6:42 pm
Location: Pune
Contact:

datawarehousing

Post by Saama »

Hi Gurus,

what is the best practice in datawarehousing, what is maximum amount of data that a source stage can handle in parallel version.

we break the data in chucks and handle it separetly.

what are the maximum columns or rows that .

please help me in analyzing.

cheers;
saama
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

For implementing datawarehouses, wow, thats a huge conversation. I suggest you read Ralph Kimbal or Inmon books.
A source stage can handle any amount of data, there is no limitation.
If you divide it in chunks, thats ok. This way you can extract in chunks by utilizing multiple instance jobs.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I would look at some key tables. Count the rows. Try to estimate the row lenght by guessing the average row length. Take the largest table row length times number of rows and make sure you have several times that amount of disk maybe 10 or 20 times. WAG is maybe the best you can do unless you want to hire soemone who can do a better job of estimating because you are going to need to do a lot more work to figure what tables you are sourcing and how many times you plan on landing the data. I am sure Ray's analysis is a lot more complicated than my simple solution to try and get you an idea of where to start. Most of this is from we call common sense or maybe experience. Something that there is no short answer for.
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Data are never "kept" within DataStage, so there's no limit on how much data a stage can handle. Data just flow through DataStage.

A well-considered design will include one or two staging areas, for the purposes of restart/recovery, but theoretically (and as advised by DataStage sales folks over the years) the key to performance is never to touch your data down to disk - just keep it streaming through DataStage.

It works, too, if nothing goes wrong and you continue to have access to source and target systems, and can do all this within the allocated time windows. But a little caution is a Good Thing.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

"we break the data in chucks and handle it separetly. "
How is this been handled? Row wise split up or columns wise split up? Though some of your recent post seems to be "Interview Question", you could explain on how do you implement or plan to implement this. At times, it will lead to inefficient unless otherwise its been handled carefully.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply