Slow Writing to Hashed files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Slow Writing to Hashed files

Post by asitagrawal »

Hi,

The Job design is this :
http://asit.agrawal.googlepages.com/J201.JPG

Input:
INPUT has 20-30Million rows (approx), with 16 fileds.
The INPUT has 8 Key Flds.

Process:
The Input row's keys are matched in the Instance, and
the matched keys' row is written to output links.

Output:
The 8 Keys are written to LineKeys_RowNum_INSTANCE
and also a 9th Fld, which is @INROWNUM.


As mentioned earlier, the INPUT has 16 flds = 8 Key flds + 8 flds,
2 out of 8 flds are written to each of the four Hashed files,
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE,
along with @INROWNUM as the key fld.

@INROWNUM is the only key fld for the Hashed files
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE.

Property Selection:
Allow stage write cache is selected for all 5 hashed files.
Create file is also selected for all 5 hashed files.
Pre-load file to memory is enabled for Instance hashed file

Read Cache and Write Cache size is 512 MB.

There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.

Problem:


By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.

So, plz advice on how to enhnance the writing to hashed files performance.


Some Statistics:
For Writing 5M rows , time taken = 01 hour 06 Mins
For Writing 2.8M rows , time taken = 56 Mins





Thanks.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Re: Slow Writing to Hashed files

Post by gateleys »

If you think that matching of 8 key columns is causing the delay, did you try computing a CRC32 for the concatenated key columns and comparing them, rather than string comparisons? Just a thought.

gateleys
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Then for this, in the Instance hashed file I have to have the CRC value, instead of the 8 Key flds , right ??

Or plz elaborate a litlle more on the approach u suggested...
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hi All,

A little more info :


The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..
trobinson
Participant
Posts: 208
Joined: Thu Apr 11, 2002 6:02 am
Location: Saint Louis
Contact:

Post by trobinson »

Have you analyzed the hashed files' structure? The Modulo and separation? Does your Create File being checked still create the default dynamic hashed files or have you tuned it?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Re: Slow Writing to Hashed files

Post by kcbland »

asitagrawal wrote:Allow stage write cache is selected for all 5 hashed files.
Read Cache and Write Cache size is 512 MB.

There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.

By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.
Hmmm, bunch of job instances writing to same hashed files using write-delayed caching process all of their rows and then just seem to sit there.

This is a great example of a "cache-flush". All rows are sitting in cache, now they have to write to disk. At some point you do have to wait while the data is written to disk, which is now. There's not much to tune, except to buy faster disks. Use Performance Monitor and watch your cpu and disk utilizations during the run.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hey kcbland,

I think u assumed something wrong..... :!:
In the design, the Output hashed files are named as Line1_INSTANCE,i.e All the 10 Instances are creating there individual Hashed files....
Not two instances of the job are writing into the same hashed file....
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

asitagrawal wrote:Hi All,
The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..
In this case the CRC can be for those 4 column should be done on both input as well as the Instance Lookup.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

asitagrawal wrote: In the design, the Output hashed files are named as Line1_INSTANCE,i.e All the 10 Instances are creating there individual Hashed files....
Not two instances of the job are writing into the same hashed file....
What is the performance Stats for the output link shows? Are they constant for all the files or the delay is due to some particular link. If so you can check that particular file to improve the performance. You can alternatively check this by removing that particular link for the job and check it. Or give FALSE as constraint to restrict the rows to that link.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

kumar_s wrote:
asitagrawal wrote:Hi All,
The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..
In this case the CRC can be for those 4 column should be done on both input as well as the Instance Lookup.
Just wanted to coonfirm that, in this approach, the lookup hashed file has to have the CRC Value as the Key , right ??... i.e while creating the lookup hashed file,
I ll calculated the CRC value of the concatenated key flds, and hence this CRC value will become the key fld and the individual flds will become the normal data flds....

Plz correct me if I am wrong .
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You are right, CRC32 will give you unique single field value for combination of several fields. So you can reduce the over head of doing the lookup for all the fields. But you need to make sure, while preparing the CRC field, the datatype and length should be same, else you will end up in getting different value and lookup mismatch.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hey Kumar,

What is ur opinion abt use of @INROWNUM for affecting the performance ??

Don't u feel that it could also be a possible culprit ???

Also I feel that I should try by increasing the Write Cache... The RAM I have is 12 GB.. but I can't increase the Cache beyond 999 Kb... can I change it from somewhere else ???
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If I understand properly, you are writing the row num using @INROWNUM in addition to the exisiting columns, rigth? If there is no much transformation included in that, that particular macro shouldn't cause the slog.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

kumar_s wrote:If I understand properly, you are writing the row num using @INROWNUM in addition to the exisiting columns, rigth? If there is no much transformation included in that, that particular macro shouldn't cause the slog.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.
yes, ur understanding is absolutely correct..... so I ll start nailing down with the help of perform stats... thx Kumar !
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

asitagrawal wrote:I think u assumed something wrong..... :!:
Doesn't matter - you are still seeing the wait incurred while the cache flushes. I've never seen a burning need to turn that on, try it with it off and see what the difference is.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply