Slow Writing to Hashed files
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Slow Writing to Hashed files
Hi,
The Job design is this :
http://asit.agrawal.googlepages.com/J201.JPG
Input:
INPUT has 20-30Million rows (approx), with 16 fileds.
The INPUT has 8 Key Flds.
Process:
The Input row's keys are matched in the Instance, and
the matched keys' row is written to output links.
Output:
The 8 Keys are written to LineKeys_RowNum_INSTANCE
and also a 9th Fld, which is @INROWNUM.
As mentioned earlier, the INPUT has 16 flds = 8 Key flds + 8 flds,
2 out of 8 flds are written to each of the four Hashed files,
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE,
along with @INROWNUM as the key fld.
@INROWNUM is the only key fld for the Hashed files
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE.
Property Selection:
Allow stage write cache is selected for all 5 hashed files.
Create file is also selected for all 5 hashed files.
Pre-load file to memory is enabled for Instance hashed file
Read Cache and Write Cache size is 512 MB.
There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.
Problem:
By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.
So, plz advice on how to enhnance the writing to hashed files performance.
Some Statistics:
For Writing 5M rows , time taken = 01 hour 06 Mins
For Writing 2.8M rows , time taken = 56 Mins
Thanks.
The Job design is this :
http://asit.agrawal.googlepages.com/J201.JPG
Input:
INPUT has 20-30Million rows (approx), with 16 fileds.
The INPUT has 8 Key Flds.
Process:
The Input row's keys are matched in the Instance, and
the matched keys' row is written to output links.
Output:
The 8 Keys are written to LineKeys_RowNum_INSTANCE
and also a 9th Fld, which is @INROWNUM.
As mentioned earlier, the INPUT has 16 flds = 8 Key flds + 8 flds,
2 out of 8 flds are written to each of the four Hashed files,
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE,
along with @INROWNUM as the key fld.
@INROWNUM is the only key fld for the Hashed files
Line1_INSTANCE, Line2_INSTANCE, Line3_INSTANCE and Line4_INSTANCE.
Property Selection:
Allow stage write cache is selected for all 5 hashed files.
Create file is also selected for all 5 hashed files.
Pre-load file to memory is enabled for Instance hashed file
Read Cache and Write Cache size is 512 MB.
There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.
Problem:
By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.
So, plz advice on how to enhnance the writing to hashed files performance.
Some Statistics:
For Writing 5M rows , time taken = 01 hour 06 Mins
For Writing 2.8M rows , time taken = 56 Mins
Thanks.
Re: Slow Writing to Hashed files
If you think that matching of 8 key columns is causing the delay, did you try computing a CRC32 for the concatenated key columns and comparing them, rather than string comparisons? Just a thought.
gateleys
gateleys
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Re: Slow Writing to Hashed files
Hmmm, bunch of job instances writing to same hashed files using write-delayed caching process all of their rows and then just seem to sit there.asitagrawal wrote:Allow stage write cache is selected for all 5 hashed files.
Read Cache and Write Cache size is 512 MB.
There are 10 parallel Instances running for this job.
At max, 50% of input rows, will be eligible for being written
to the hashed files i.e. approx 10Million.
By observing the Monitor, I find that, even when the
expected number of input rows is being read,
job takes pretty longer to finish.
i.e. I dont see any change in the number of input
rows being read , coz expected number of input rows is already been read,
the job is not finshing writing quickly to the Hashed files.
This is a great example of a "cache-flush". All rows are sitting in cache, now they have to write to disk. At some point you do have to wait while the data is written to disk, which is now. There's not much to tune, except to buy faster disks. Use Performance Monitor and watch your cpu and disk utilizations during the run.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
In this case the CRC can be for those 4 column should be done on both input as well as the Instance Lookup.asitagrawal wrote:Hi All,
The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
What is the performance Stats for the output link shows? Are they constant for all the files or the delay is due to some particular link. If so you can check that particular file to improve the performance. You can alternatively check this by removing that particular link for the job and check it. Or give FALSE as constraint to restrict the rows to that link.asitagrawal wrote: In the design, the Output hashed files are named as Line1_INSTANCE,i.e All the 10 Instances are creating there individual Hashed files....
Not two instances of the job are writing into the same hashed file....
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Just wanted to coonfirm that, in this approach, the lookup hashed file has to have the CRC Value as the Key , right ??... i.e while creating the lookup hashed file,kumar_s wrote:In this case the CRC can be for those 4 column should be done on both input as well as the Instance Lookup.asitagrawal wrote:Hi All,
The Input has 8 key flds, but the reference has just 4...
so it is a match of only 4 key flds...not all 8..
I ll calculated the CRC value of the concatenated key flds, and hence this CRC value will become the key fld and the individual flds will become the normal data flds....
Plz correct me if I am wrong .
You are right, CRC32 will give you unique single field value for combination of several fields. So you can reduce the over head of doing the lookup for all the fields. But you need to make sure, while preparing the CRC field, the datatype and length should be same, else you will end up in getting different value and lookup mismatch.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Hey Kumar,
What is ur opinion abt use of @INROWNUM for affecting the performance ??
Don't u feel that it could also be a possible culprit ???
Also I feel that I should try by increasing the Write Cache... The RAM I have is 12 GB.. but I can't increase the Cache beyond 999 Kb... can I change it from somewhere else ???
What is ur opinion abt use of @INROWNUM for affecting the performance ??
Don't u feel that it could also be a possible culprit ???
Also I feel that I should try by increasing the Write Cache... The RAM I have is 12 GB.. but I can't increase the Cache beyond 999 Kb... can I change it from somewhere else ???
If I understand properly, you are writing the row num using @INROWNUM in addition to the exisiting columns, rigth? If there is no much transformation included in that, that particular macro shouldn't cause the slog.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
yes, ur understanding is absolutely correct..... so I ll start nailing down with the help of perform stats... thx Kumar !kumar_s wrote:If I understand properly, you are writing the row num using @INROWNUM in addition to the exisiting columns, rigth? If there is no much transformation included in that, that particular macro shouldn't cause the slog.
As I said, you nail down the culprit, you can remove link and find the statistics to know where the actual bottle neck is. You can create the Copyof the current job for quick testing.
Doesn't matter - you are still seeing the wait incurred while the cache flushes. I've never seen a burning need to turn that on, try it with it off and see what the difference is.asitagrawal wrote:I think u assumed something wrong.....
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers