Slow Writing to Hashed files

asitagrawal · Post by **asitagrawal** » Thu Mar 08, 2007 2:35 pm

Hi,

The Job design is this :
http://asit.agrawal.googlepages.com/J201.JPG

Input:
INPUT has 20-30Million rows (approx), with 16 fileds.
The INPUT has 8 Key Flds.

Process:
The Input row's keys are matched in the Instance, and
the matched keys' row is written to output links.

Output:
The 8 Keys are written to LineKeys_RowNum_INSTANCE
and also a 9th Fld, which is @INROWNUM.

As mentioned earlier, the INPUT has 16 flds = 8 Key flds + 8 flds,
2 out of 8 flds are written to each of the four Hashed files,
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE,
along with @INROWNUM as the key fld.

@INROWNUM is the only key fld for the Hashed files
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE.

Property Selection:
Allow stage write cache is selected for all 5 hashed files.
Create file is also selected for all 5 hashed files.
Pre-load file to memory is enabled for Instance hashed file

Read Cache and Write Cache size is 512 MB.

There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.

Problem:

By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.

So, plz advice on how to enhnance the writing to hashed files performance.

Some Statistics:
For Writing 5M rows , time taken = 01 hour 06 Mins
For Writing 2.8M rows , time taken = 56 Mins

Thanks.

gateleys · Post by **gateleys** » Thu Mar 08, 2007 2:47 pm

If you think that matching of 8 key columns is causing the delay, did you try computing a CRC32 for the concatenated key columns and comparing them, rather than string comparisons? Just a thought.

gateleys

asitagrawal · Post by **asitagrawal** » Thu Mar 08, 2007 2:53 pm

Then for this, in the Instance hashed file I have to have the CRC value, instead of the 8 Key flds , right ??

Or plz elaborate a litlle more on the approach u suggested...

asitagrawal · Post by **asitagrawal** » Thu Mar 08, 2007 3:00 pm

Hi All,

A little more info :

The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..

trobinson · Post by **trobinson** » Thu Mar 08, 2007 3:19 pm

Have you analyzed the hashed files' structure? The Modulo and separation? Does your Create File being checked still create the default dynamic hashed files or have you tuned it?

kcbland · Post by **kcbland** » Thu Mar 08, 2007 3:49 pm

asitagrawal wrote:Allow stage write cache is selected for all 5 hashed files.
Read Cache and Write Cache size is 512 MB.

There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.

By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.

Hmmm, bunch of job instances writing to same hashed files using write-delayed caching process all of their rows and then just seem to sit there.

This is a great example of a "cache-flush". All rows are sitting in cache, now they have to write to disk. At some point you do have to wait while the data is written to disk, which is now. There's not much to tune, except to buy faster disks. Use Performance Monitor and watch your cpu and disk utilizations during the run.

asitagrawal · Post by **asitagrawal** » Fri Mar 09, 2007 2:52 am

Hey kcbland,

I think u assumed something wrong.....

In the design, the Output hashed files are named as Line1_INSTANCE,i.e All the 10 Instances are creating there individual Hashed files....
Not two instances of the job are writing into the same hashed file....

kumar_s · Post by **kumar_s** » Fri Mar 09, 2007 4:15 am

asitagrawal wrote:Hi All,
The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..

In this case the CRC can be for those 4 column should be done on both input as well as the Instance Lookup.

kumar_s · Post by **kumar_s** » Fri Mar 09, 2007 4:17 am

asitagrawal wrote: In the design, the Output hashed files are named as Line1_INSTANCE,i.e All the 10 Instances are creating there individual Hashed files....
Not two instances of the job are writing into the same hashed file....

What is the performance Stats for the output link shows? Are they constant for all the files or the delay is due to some particular link. If so you can check that particular file to improve the performance. You can alternatively check this by removing that particular link for the job and check it. Or give FALSE as constraint to restrict the rows to that link.

asitagrawal · Post by **asitagrawal** » Fri Mar 09, 2007 4:22 am

kumar_s wrote:
asitagrawal wrote:Hi All,
The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..
In this case the CRC can be for those 4 column should be done on both input as well as the Instance Lookup.

Just wanted to coonfirm that, in this approach, the lookup hashed file has to have the CRC Value as the Key , right ??... i.e while creating the lookup hashed file,
I ll calculated the CRC value of the concatenated key flds, and hence this CRC value will become the key fld and the individual flds will become the normal data flds....

Plz correct me if I am wrong .

kumar_s · Post by **kumar_s** » Fri Mar 09, 2007 4:28 am

You are right, CRC32 will give you unique single field value for combination of several fields. So you can reduce the over head of doing the lookup for all the fields. But you need to make sure, while preparing the CRC field, the datatype and length should be same, else you will end up in getting different value and lookup mismatch.

asitagrawal · Post by **asitagrawal** » Fri Mar 09, 2007 4:38 am

Hey Kumar,

What is ur opinion abt use of @INROWNUM for affecting the performance ??

Don't u feel that it could also be a possible culprit ???

Also I feel that I should try by increasing the Write Cache... The RAM I have is 12 GB.. but I can't increase the Cache beyond 999 Kb... can I change it from somewhere else ???

kumar_s · Post by **kumar_s** » Fri Mar 09, 2007 4:50 am

If I understand properly, you are writing the row num using @INROWNUM in addition to the exisiting columns, rigth? If there is no much transformation included in that, that particular macro shouldn't cause the slog.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.

asitagrawal · Post by **asitagrawal** » Fri Mar 09, 2007 4:57 am

kumar_s wrote:If I understand properly, you are writing the row num using @INROWNUM in addition to the exisiting columns, rigth? If there is no much transformation included in that, that particular macro shouldn't cause the slog.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.

yes, ur understanding is absolutely correct..... so I ll start nailing down with the help of perform stats... thx Kumar !

chulett · Post by **chulett** » Fri Mar 09, 2007 8:12 am

asitagrawal wrote:I think u assumed something wrong.....

Doesn't matter - you are still seeing the wait incurred while the cache flushes. I've never seen a burning need to turn that on, try it with it off and see what the difference is.

DSXchange

Slow Writing to Hashed files

Slow Writing to Hashed files

Re: Slow Writing to Hashed files

Re: Slow Writing to Hashed files