Server hit by lots of log queries

roy · Post by **roy** » Mon Jan 19, 2004 11:54 am

Hi Yall,

one of the projects I support has a mjor issue with DS server load problems.
they managed to narrow the problem to lots and lots of short log file sampling processes done by uv processess.
the unique thing about this project is that it uses DS for loading tables at least 5 times an hour (yes not pure DWH I know).

so does anyone have any idea what might cause this?
has anyone had this issue?
any ideas would be apreciated.
(on a course right now so will look in to it tomorw)

Thanks in advance,

kcbland · Post by **kcbland** » Mon Jan 19, 2004 12:17 pm

Your situation is not quite clear. If what you are saying is that the uv processes (release 5 DataStage threads) are hitting the database with a lot of individual SQL statements, then yes, that is the way it works.

If you have 1 million rows in a source sequential file, that sends data to a transformer with a ODBC/OCI reference lookup, and outputs to a sequential file, then you will have 1 million distinct SQL queries against the database. Even if all 1 million queries are identical, you will have 1 statements issued. If you're lucky, query cache will kick in. If not, you undergo the query and all the labor involved. If you instantiated the job to divide and conquer the source data, then you have N jobs all issuing 1million/N queries, but it's still 1 million queries.

This is why I constantly labor to show the benefit of scanning your source data for natural keys, putting that into a work table in the target database, inner joining that work table to the target table, and unloading those rows and untimately putting into a hash file. That means 1 query, with the ability to parallelize, with a concentrated stream of spooling data, into a hash file that has read cache capability, shared cache capability, and scales no matter how many instantiated clones are using it.

roy · Post by **roy** » Mon Jan 19, 2004 12:35 pm

Thanks Ken,
but my situation is not heavy load on DB instance it is with the DS engine machine that gets the load problem from many short queries to the DS log files

any ideas?

kcbland · Post by **kcbland** » Mon Jan 19, 2004 12:40 pm

roy wrote:Thanks Ken,
but my situation is not heavy load on DB instance it is with the DS engine machine that gets the load problem from many short queries to the DS log files
any ideas?

Ooohhh, that's what you're talking about... There's nothing you can do about it from a DataStage perspective. Hash files get banged around a lot, so about the only thing you can do is make sure you put projects and work areas on file systems that aren't being traced or transactionally logged. In the past, I had situations with clients where DataStage ran at a snails pace because of this, so we made sure that our mount points were isolated and on dedicated disks and controllers.

roy · Post by **roy** » Mon Jan 19, 2004 1:45 pm

Well,
Actually it might help to understand what makes theese log hits occur?
and is there a way to minimize them or at least try to make the log hits less often?
does anyone know anything about that?
or is it just how things work?

Thanks again.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 19, 2004 3:20 pm

Having a Director (or more than one) open in log view doesn't help. Switch to Status view, or minimize Director, or (best) close Director.

As far as running jobs hitting the logs is concerned (exactly how did you establish that this is happening?), find out what kind of things are being logged and seek ways to minimize that. For example, do you have routines that call DSLogInfo/DSLogWarn but probably don't need to?
Jobs themselves only log a limited number of events, such as jobs and active stages starting and stopping and link reports.

Keeping your log files purged as much as possible (even if you archive them prior to purging) will also help.

Periodically clean out old entries from the &PH& and &SAVEDLISTS& directories on the server.

roy · Post by **roy** » Mon Jan 19, 2004 3:33 pm

Thanks Ray,
The server actually has no director open usually since this is production and I can also set it to refresh in large intervals.
But come to think of it running many jobs each 10 minutes that read/write from DB to file and or back to DB might actually be the reason, in which case there is nothing really I can do.
I'll check to see about minimizing log write from the Basic code to a minimum.
I was info searching to get a head start on the visit I will make there tomorrow.
I'll post my findings.

Thanks for the quick replies

roy · Post by **roy** » Tue Jan 20, 2004 12:38 pm

Ken,
by:

make sure you put projects and work areas on file systems that aren't being traced or transactionally logged. In the past, I had situations with clients where DataStage ran at a snails pace because of this, so we made sure that our mount points were isolated and on dedicated disks and controllers.

do you mean to build projects on a storage disk space rather then directly on the DS Server's local disk?

basically it is a single project, so far, so several disks spreading projects among them might not solve my issue for now.

Thanks in advance,

kcbland · Post by **kcbland** » Tue Jan 20, 2004 6:04 pm

roy wrote: do you mean to build projects on a storage disk space rather then directly on the DS Server's local disk?

Always separate projects from data. As for the location of the project, this really shouldn't be an issue regarding the type of disks, as long as the project is local to the server (not on an NFS mounted filesystem).

For the life of me, I can't figure out how you chased this down to being job logs. Job logs are name RT_LOGxxx, so the physical file is located within the project as a subdirectory therein. Since each job has its own RT_LOGxxx file, no single job should be crippling the system. Even the cumulative effect of multiple instantiated jobs using the same master job log file shouldn't be a big issue. Now, of course, if that log has 2 million entries in it, I can see how it would be a pig on the system to maintain (dynamic resizing and all).

So, if you can tell me how you found out it is an RT_LOGxxx file I sure would appreciate it.

roy · Post by **roy** » Wed Jan 21, 2004 6:02 am

Hi,
Well Ken using ps I find the pid of my job processes.
then using truss I gain all that happens for them and their child processes.
this gives me the info : for 10 seconds worth of job run (for it and it's child processes) about 11,000 reads and about the same number of writes.
now if I watch the data going back and forth it has strings in the format of a logfile.
i.e. batch::batchname.invocationid setting nls enus.....
also requesting job run and waiting for job etc'..
this leads me to assume this is log related (unless my assumptions are wrong)
so I asked them to send me a smaple of this and once I get it it's going to Ascential,so they may explain and see what can( if can) be done about it.

the reads are done in 2k blocks, though I don't know if there is any need for them to be larger in size for the info the pass.

Thats All I know for now.

kcbland · Post by **kcbland** » Wed Jan 21, 2004 6:55 am

How are you doing job control? Do you use custom written BASIC logic, sequencers, or plain-old BATCH jobs? The brief sample you stated indicates that it is log messages. Now, I noticed you typed batchname.invocationid, so, are you having a zillion instantiations of the same job?

You know that the master job gets messages put into its log by all of the clones. If you are not purging that master log manually, it will fill up with a bunch of log messages. I do believe a log only clears after a job runs. If you run invocations/instances/clones of that job, then the master job actually never runs and I wonder if the master log never clears. That log could be massive by the end of a day. You may want to check that out.

roy · Post by **roy** » Wed Jan 21, 2004 8:05 am

Ken,
as you figured out (and the reason I mentioned that message portion) I use basic controll jobs and they are multi instance ones.
no sequencers involved only pure basic wraper jobs.
this is done to dynamically run the required job/s in order to populate the requested table, from source to target, making turn around time for adding a new table load to be ASAP (around 1 hour! from development starting time to tests/production available)
the log purge is set to purge older then 1 day old run entries.
and from what I've seen around 10:00 AM today the largest RG_LOG file was 32 megs in size, most were resonably small.
If this is how DS works then I guess there is not much I can do about it.
I just hope that the expantion planned for this system won't cause the server to stand poot.

kcbland · Post by **kcbland** » Wed Jan 21, 2004 9:19 am

Okay, but you never really defined what the performance problem is. Are jobs with "DS server load problems" running slow you believe it's because of the job log interaction? How are you defining "slow"? How are you measuring that? A job of the nature seq --> xfm --> seq should 100% utilize a cpu (single-threaded design, one transformer). If you see the cpu pegged, then what's the problem? Are your designs mix/matched with db stages?

roy · Post by **roy** » Thu Jan 22, 2004 5:15 am

Well the "customer" thought the process is not running on the production server as fast as they expected in acordance to the compared configuration enhancements that the productin system has vrs. the development system.
so they started checking what was going on and found this, which seems as heavy load on the server, that needs investigation and perhaps tuning.
so the question is do we have anything to do about it?
in case this is how DS works then the answer would be, we can do nothing except minimize,if possible, (as Ray siad) the number of DSLog calls from the basic code we initiate ourselves.

kcbland · Post by **kcbland** » Thu Jan 22, 2004 8:39 am

roy wrote:Well the "customer" thought the process is not running on the production server as fast as they expected in acordance to the compared configuration enhancements that the productin system has vrs. the development system.
so they started checking what was going on and found this, which seems as heavy load on the server, that needs investigation and perhaps tuning.
so the question is do we have anything to do about it?
in case this is how DS works then the answer would be, we can do nothing except minimize,if possible, (as Ray siad) the number of DSLog calls from the basic code we initiate ourselves.

Aiyee, I should have started this whole discussion by asking you how you measured that a job was underperforming. I have an expression for this type of troubleshooting: "Shooting ducks at night". What it means is that we are taking shots at sounds in the air at night. Hear a quack, fire a shot in the general direction.

You absolutely have to determine if a job is underperforming and why. You need to use "prstat -a" and watch a job like seq --> xfm --> seq. If that job totally consumes a cpu, you have no performance issue with your DataStage logs. I don't think you have any performance issue with DataStage. If you are comparing dev and production throughput as apples to apples, that expectation has to be adjusted. If your jobs have OCI/ODBC reference lookups and you're comparing job execution in dev against small datasets and production with large datasets, your team is making a mistake.

Roy, write a simple job that reads a large sequential file, maybe 200K rows of about 200 bytes per row, passes it thru a transformer trimming a few columns and UPCASEing a few others, and write it to another sequential file. Watch prstat as it runs. If the cpu % shows 1/Number of cpus on your server (uname -X), then the job is running at full speed and you have no issue other that proper analysis, tuning of jobs, and setting of performance expectations.