Page 1 of 1

dataset read problem

Posted: Wed Nov 08, 2006 1:49 am
by adasgupta123
Hi all,

I have been assigned to tune some parallel jobs.
I observed in some jobs reading from dataset is taking huge time.

Pls advice me to increase no. of rows/sec while reading from dataset.

thanks and regards

Avik Dasgupta

Posted: Wed Nov 08, 2006 3:34 am
by balajisr
How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.

Re: dataset read problem

Posted: Wed Nov 08, 2006 7:13 am
by tagnihotri
Also look in config file and check for the filesystem and mounts for the directories mentioned in for scratch and resource!
Then we can talk

adasgupta123 wrote:Hi all,

I have been assigned to tune some parallel jobs.
I observed in some jobs reading from dataset is taking huge time.

Pls advice me to increase no. of rows/sec while reading from dataset.

thanks and regards

Avik Dasgupta

Posted: Wed Nov 08, 2006 1:08 pm
by ray.wurlod
What are your parallel job tuning credentials? That is, why did they give you the task? How much experience do you have in this area?

Datastage read problem

Posted: Wed Nov 08, 2006 11:40 pm
by adasgupta123
ray.wurlod wrote:What are your parallel job tuning credentials? That is, why did they give you the task? How much experience do you have in this area?
I am very new to datastage.I developed some parallel jobs in last two
months.

Posted: Wed Nov 08, 2006 11:46 pm
by adasgupta123
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

Posted: Thu Nov 09, 2006 12:00 am
by tagnihotri
Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

Posted: Thu Nov 09, 2006 12:01 am
by tagnihotri
Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

Posted: Thu Nov 09, 2006 2:06 am
by adasgupta123
Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote:
balajisr wrote:How many rows do you have in each partition?
What is your partition count?
Post your job design.You need to give more details.
Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

Posted: Thu Nov 09, 2006 7:51 am
by tagnihotri
RCP should not effect the performance. If data is well distributed and file mount are proper (i.e. individual filesystem mount for nodes) then are you sure that the issue is while reading dataset!

The performance issue may be because of some other processing you are doing in your job. How exactly have you blamed dataset read, can you elaborate please :?:
adasgupta123 wrote:Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).


adasgupta123 wrote: Hi,

The partition count is 8 and i have checked the filesystem,the memory usage is ok.

Posted: Thu Nov 09, 2006 10:16 am
by adasgupta123
Basically we are handling huge amont of data every day(around 300GB!)
and it is getting larger and lager every month.

In most of the jobs the dataset is the first stage and final o/p stage i.e
the output dataset of one job is acting as a input to the next job.
In the jobs there are mainly join and transformation stages.In some
cases there are funnel,filter stages.

I am guessing dataset read problem because in all other stages out put
links the no. o rows per second is much higher than in the case of dataset.





tagnihotri wrote:RCP should not effect the performance. If data is well distributed and file mount are proper (i.e. individual filesystem mount for nodes) then are you sure that the issue is while reading dataset!

The performance issue may be because of some other processing you are doing in your job. How exactly have you blamed dataset read, can you elaborate please :?:
adasgupta123 wrote:Hi ,

I have checked the mount points.Data is well distributed accros all the
8 nodes.One thing i wish to inform that run time column propagation option is enabled.Is it delaying the read process?

tagnihotri wrote:Its not about usage! try and find the mount.. Also when you say 8 node are all the 8 nodes used well and is the dataset data well distributed (check out source for this).



Posted: Thu Nov 09, 2006 1:08 pm
by ray.wurlod
Etiquette Note
It is not necessary to overquote all previous replies - they're there in the thread. Also, using Quote severely restricts your ability to earn points.

Posted: Thu Nov 09, 2006 11:45 pm
by tagnihotri
Ray, I will take a note of this from there on! thanks


Adasgupta,
Can You please detail your job design

Posted: Fri Nov 10, 2006 12:46 am
by ray.wurlod
Rows/sec is an almost completely meaningless metric. Various factors influence it, usually negatively, such as row width, network bottlenecks, the clock still running after all rows have been processed, and so on. I have posted before on this. There can be no such thing as an answer to the question "what is a typical rows/sec?". The main way to increase the read rate from a Data Set is to increase buffer sizes and not to have any slower stage types downstream of it. But sometimes you just have to. All else being equal, minimize the time taken by ensuring that rows are distributed equally across all partitions when the Data Set is populated.