Server hit by lots of log queries

mhester · Post by **mhester** » Thu Jan 22, 2004 9:13 am

Ken,

So, if you can tell me how you found out it is an RT_LOGxxx file I sure would appreciate it.

by reading the replies I can see how this poster determined it was the connection between the client and server, but to be accurate you should use software that was designed specifically to monitor this type of traffic.

With my current customer we experienced severe overhead when trying to open, save, compile or export jobs based on large CFF data stream. These records were 3636 bytes with 2711 columns. Sometimes just saving these jobs would cause a timeout and failure. Ascential has since fixed this problem (vmdsrpos.dll) and our problem has been solved.

Now to the point.... we have tools internally that monitor network traffic. We used these tools to monitor DS and the server and found that there was a tremendous amount of communication happening between the client and server and it happens in 32k blocks. (most other COTS apps deployed on our servers were communicating in much larger blocks).

As a benefit of this we could see that the communication happening between the director and the server was especially heavy when running multiple jobs and the trace showed what file system and files (hash or sequential or relational) were being used. It also showed the actual processes started by DS to perform a given task.

Sorry I was so long winded, but the point is that the DS client is very fat and spends much of its waking time communicating with the server in very small packets. As soon as we applied the vmdsrpos.dll patch speed increased in both the designer and director.

Regards,

roy · Post by **roy** » Thu Jan 22, 2004 9:29 am

Ken,
we did find that actually the prod server runs faster as expected, that was just the trigger that found this heavy load on the DS prod server since they thought something was wrong the started checking and now I need to determine is there anything wrong in this behaviour or is it just how it works.(happens on the development as well)
mhester,
there is no client issue here at all, since the whole thing is automated and no client is ever opend with the production server as it's target.

this is pure server side.
the thing is that theese 2k in size multiple reads and writes are quite heavy on the server machine.
now if there is something that can be done about this I'll gladly do it, but as of now I'm not sure there is something we can do.

I'll try to get Ascential on this and see what they have to say about it.

Thanks guys

kcbland · Post by **kcbland** » Thu Jan 22, 2004 9:32 am

Actually, the poster Roy stated:

The server actually has no director open usually since this is production and I can also set it to refresh in large intervals.

After so much conversation, Roy, then said:

Code: Select all

Well the "customer" thought the process is not running on the production server as fast as they expected in acordance to the compared configuration enhancements that the productin system has vrs. the development system.

This is inadequate for analysis purposes. We need to know exactly how they determined a job is underperforming. So, there absolutely could be a series of issues, from fat client (AGREE WITH YOU THERE

) layering with disk and controller bottlenecks, to user expectations all layered like an onion.

My suggestion is a seq --> xfm --> seq job to refute all issues. If that job doesn't hog up a cpu, then there are outside influences. But, if that uses a cpu 100%, then there's no problem. You have to start from there and begin eliminating. This is probably the easiest control case job to build and test. Then, run a whole both of instantiated copies to see if a log filling is introducing overhead. I did mention that clones all update the master job log, and that could be introducing overhead to a job starting and finishing, as those logs get messages. But Roy need to take things one step at a time.

Thanks Mike, one more reason to stay on the latest release. 5.1 is unbearable.

Teej · Post by **Teej** » Thu Jan 22, 2004 9:40 am

mhester wrote:Sorry I was so long winded, but the point is that the DS client is very fat and spends much of its waking time communicating with the server in very small packets. As soon as we applied the vmdsrpos.dll patch speed increased in both the designer and director.

This patch -- what version of DataStage was it applied for?

-T.J.

roy · Post by **roy** » Thu Jan 22, 2004 9:45 am

I understand you Ken,
the whole thing was that they thought and never mind how they came to think that and they were proven mistaken (regarding server not fast enough).
now while probing the ds server they have seen this heavy load as something odd. from here on I came to the picture in order to find out what and if something can be done about it (yes I'll do as you said pure cpu job so no need for the grumpy icon

).

now I only need to find is this how it works?
since you Ken and others (Ray in their as well) know DS so well I was wondering if you might be able to state if this is how it should work as far as you know? (the multiple read/writes to log issue in 2k blocks)
or does it seem odd in your experiance as well?

Thanks for being patiant with me Ken

(apreciated

)

kcbland · Post by **kcbland** » Thu Jan 22, 2004 9:54 am

Roy, what you are saying it that there is no issue.

Now you just need to know is the log updating in 2K blocks okay and typical and not a problem? Ray, Kim, and Mike can probably answer best the technical ways that hash files are operated on.

I can say this though, if it's not a problem, don't fix it. If your jobs are okay, then I'll move on.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 22, 2004 3:23 pm

Dynamic hashed files (the kind used by DataStage for repository tables) are updated in 2KB blocks unless their GROUP.SIZE parameter is set to use 4KB blocks, which is not the case (though you can change it, it would be of no benefit for log files, since each event goes to a different group, because of the hashing algorithm.

When the log file needs to increase its size, there will be more I/O activity; again in multiples of the group size. During this process a group latch is taken on the log file's header. This can slightly slow down access to the file by multiple multi-instance jobs. There may be value, for multi-instance jobs, in pre-allocating log file space using the RESIZE verb (in which, for the UV gurus out there, the modulo parameter is assumed to alter the minimum modulus setting of the dynamic hashed file).

There is potential scope for improving the performance of log files (such as using a SEQ.NUM hashing algorithm on creation), but they're technically not visible so why would you?

roy · Post by **roy** » Thu Jan 22, 2004 5:57 pm

Thanks guys

,
so to sum-up this, if I understand, the fact that 30 or so instances of the same batch job runs in parallel may cause the system to do this 11000 writes and 11000 reads to the log in 10 seconds and it is no indication of anything wrong.

well Good To know (never thought any post of mine would reach this amount of posts)

Thanks alot

ariear · Post by **ariear** » Fri Jan 23, 2004 3:52 pm

Roy,

The heavy traffic (Read/Write) about the log files was implied by struss utility (100MB reas per second !!! - so they say) - a fact that has to be re-checked. Did they mentioned it ??!!

ArieAR

Teej · Post by **Teej** » Fri Jan 23, 2004 4:49 pm

roy wrote:so to sum-up this, if I understand, the fact that 30 or so instances of the same batch job runs in parallel may cause the system to do this 11000 writes and 11000 reads to the log in 10 seconds and it is no indication of anything wrong.

30 jobs in parallel?

Wow, I want that computer!

-T.J.

roy · Post by **roy** » Sat Jan 24, 2004 1:33 pm

Well Arie (welcome back, I hope you had time for fun on your trip),
I saw it with my own eyes, unless that truss utility is not accurate.
Teej, we are not talking about 30 DWH heavy duty jobs for fact tables, just quite simple multiple instances of basic controll jobs and their source to target table handling "children" jobs.
and you might be surprised to hear it is probably a "weaker" configuration then you use, at least for now.
if you wish you had that server think again, I doubt it would be faster with the loads you process in your project, then again I have no idea what you do (except you use enterprise edition, which is enough for me to asume this

)

kduke · Post by **kduke** » Sat Jan 24, 2004 4:24 pm

Roy

You may have a little overhead on multiple instance jobs. Multiple instance jobs share the same log file RT_LOGxx where xx is the job number. They also shared RT_STATUSxx and RT_CONFIGxx. The RT_LOG file has to keep track of the id used so it writes 2 records for each record. I do not have a way to check this but from memory there is at least on record and maybe 2 records in RT_STATUSxx per instance. I think this record keeps track of the first and last record in RT_LOGxx used by this instance. The instance is also stored as a field in RT_LOGxx. So that means one or two records get written in RT_STATUSxx for each RT_LOGxx write. That is quite a bit of overhead.

It you could put these 2 files in memory then you should see a performance gain. Ray or Ken may know how to put these files in memory. I have never tried anything like that but it sounds like fun.

I wish I had a way to give you more accurate answers but I am sure someone can validate what I said. If you edit all the RT_STATUSxx records you will see what I mean and can accurately figure out these records.

Thanks Kim.

datastage · Post by **datastage** » Sat Jan 24, 2004 5:48 pm

kduke wrote: It you could put these 2 files in memory then you should see a performance gain. Ray or Ken may know how to put these files in memory. I have never tried anything like that but it sounds like fun.
Thanks Kim.

Ray or Ken...let us in! Tell us how if you haven't already posted that before in another thread.

It will give Kim something else to dream about at night other than surrogate key management

kduke · Post by **kduke** » Sat Jan 24, 2004 6:18 pm

Byron

Figure it out. Create a multiple instance job. Run it a couple times. LIST DS_JOBS to get the job number then ED RT_STATUSxx *.

Besides I don't even care. I am just trying to help. I sure don't dream about this stuff. Life is too short to make this stuff too important.

Thanks Kim.

ray.wurlod · Post by **ray.wurlod** » Sat Jan 24, 2004 9:49 pm

Well you could investigate the SET.MODE verb to try to enable read caching of the hashed files in question (see http://dsxchange.com/viewtopic.php?t=86342 for example), but there are no guarantees that this will work.

In fact, I would be surprised if it did. To make use of hashed files cached in memory "they" had to change the underlying code in the Hashed File stage (this is way back, about version 1.2) to use C functions, because there are no functions in BASIC for manipulating memory.

WARNING

You are on your own here. If it turns out that enabling read cache disables BASIC access to hashed files, you will need to disable it again (using the SET.MODE verb). In particular, I will accept no responsibility (legal or otherwise) relating to your use or misuse of this suggestion.

There is no documentation at all on the SET.MODE verb. It is an Ascential addition to the ex-UniVerse engine, and therefore does not occur in the IBM UniVerse manuals. Syntax (which is reported if you enter the verb alone) is

Code: Select all

SET.MODE filename  [WRITE.CACHE | WRITE.CACHE.DEFER | READ.ONLY | READ.WRITE | INFORM] [ VERBOSE ]

Safe example:

Code: Select all

SET.MODE RT_CONFIG91 INFORM