Page 1 of 2

Job Run-Time is inconsistent, minutes can turn to hours

Posted: Mon Sep 09, 2013 7:21 am
by epsidude
Hi,

I have a server job which will take 17 minutes give or take a few minutes to fully complete. However there are times it can take hours. If I see that it is running long I can kill it, and restart, and it will complete in the normal 17 minute window.

During these long runs there is no extra volume of data, and no other jobs running in the system, even the Anti-Virus is turned off. The job flow is simple except it does have a large Hashed File in it. I manually created this type 30 HF to be over the 2gb limit, and type 2 on group size.

The HF is cleared and then populated by 9.6million records from a DRS stage. The source data is from a DRS stage, and is less than 900k. Two different lookups against this HF are performed. It then inserts these 900k into a table. The target table is truncated prior to loading. Of the 17 minutes during a good run 5-7minutes is taken by analyzing a large table is the post sql step.

I am going to check into the DB server to see if anything could have been happening there that could have slowed things down, but I just wanted to get peoples opinion this, and see where else I could look.

Posted: Mon Sep 09, 2013 7:50 am
by chulett
Hard to say. Sounds more like a lock than a "slow down" unless you've let it run to completion and it has completed after some number of hours?

Do you have any idea at what stage it is when it gets stuck, what it is doing? Writing to the hashed file? The target? As you said, in your shoes I'd definitely start by involving your DBA when this happens next to see if there's some kind of deadlock in the database. It doesn't sound like this job is the culprit but rather it might be the victim of some other process working in the target table at the same time.

Posted: Mon Sep 09, 2013 8:00 am
by ArndW
To add to what Craig suggested - if you look at the monitor during runtime can you seen any anomalies? Stages that haven't processed rows that should have, or are processing very slowly?

Posted: Mon Sep 09, 2013 8:02 am
by epsidude
Thank you for the reply. From the look of the log it appears that the Hashed File creation was done, and that it was doing the 2 lookups, and it was running slowly. What I am not sure of is it due to the DB extract, or the actual lookup.

What I could do I guess is write the source down to a file first then read that and see if this prevents the lookup from being slow.

Could this point to any kind of memory leak? Not that I know much about solving that yet, but just wanted to put it out there for opinions.

Posted: Mon Sep 09, 2013 8:08 am
by epsidude
Thanks ArndW, The job runs at 3:30am so I have not checked it yet, but I am going to set the alarm clock early and take a look. Unfortunately this does not happen all the time so it will be hit or miss.

The other thing I notice is high fragmentation on the disk due to large hashed files. I have pre-allocated the HF, and defragged but after running the update the fragmentation always returns to 30-50% literally overnight.

Posted: Mon Sep 09, 2013 9:50 am
by ArndW
Any fragmentation will cause only a 'drop in the bucket', since hashed files internally will have disk pointers jump all over the place. Likewise memory leaks are unlikely to be the culprit.

Are you doing a DELETE or a CLEAR.FILE on your hashed files? If the latter it will be quicker but the files will shrink to their minimum definition size.

Posted: Mon Sep 09, 2013 12:00 pm
by epsidude
I am doing a clear file, so it is shrinking back.

The HF is defined as 600,003 min grp sz, type 2.

So when I run the analyze I get:
728 empty
6393 overflowed
58 badly
Filesize is 2.8gb, with 1.7gb being unused.
No large records

Load Factors: 80% split, 50% merge, and 38% actual.

I have read several post here, and Ken Bland article and am not sure if I should try a different sizing.

Posted: Mon Sep 09, 2013 4:31 pm
by ray.wurlod
The Hashed File Calculator, which is on the installation media in the unsupported utilities, will help you to calculate the size that the hashed file needs to be.

Posted: Tue Sep 10, 2013 4:23 am
by roy
Note that having a size of over 2GB means you must have created the file with a 64Bit setting (unless your entire instance was built with 64bit), this limit is foreach of your data and overflow files seperatly.
So eliminating the overflow might cause your data file to go over the limit and crash your job.
First I would try to eliminate columns that you dont need to shrink the file size or split it into 2 seperate files, each containing half the data.

when your file size is within the limits calculate how many groups you need to predefine when creating the files to get minimum overflowand more important: avoid costly rearrangement of the hash file when your populating it.
another word of advice: use DELETE option for the hashed file each run not clear!

IHTH,

Posted: Tue Sep 10, 2013 6:06 am
by epsidude
Thank you for the replies!

So the HF was defined as 64bit to avoid any sizing issues. However I did run the calculation I found posted in other thread. I cannot find the calculator, I work at a University, and our datastage came as part of the Peoplesoft EPM and was installed before my time.

When I changed the HF to have what the calculator recommended the statistics showed everything to be worse, but there was no performance difference at all.

So last night the job in question ran in 17 minutes no issue. I did make 1 change to the job.

In the first DRS stage there are 2 output links. Link 1 writes out 9.6 million rows to the HF. Link 2 writes out 900k rows that flow to a transformer. In this transformer there are 2 lookups to the 9.6million record HF. Then there is 1 link to a final DRS that inserts all the data.

My change was to add a sequential file on link 2 before the transformer, and then a sequential file before the insert DRS.

In my experience this usually solves the erratic run-time issue. I know it must be closing the first DRS connections prior to opening the Final DRS connection, but I am still not sure as to why this solution works, and what it could point to. Again before my change we could go for a few days, even weeks at normal run-times of 17 minutes, then suddenly we get hit with run-times of 6-8 hours. :?:

Posted: Tue Sep 10, 2013 7:54 am
by roy
It seems you have a workaround :)
Marking topic as such...

Posted: Tue Sep 10, 2013 9:51 am
by chulett
Not really a "workaround" IMHO... all they've posted is a change they made. Won't know for some time if it has the desired effect or not.

Posted: Wed Sep 11, 2013 7:25 am
by roy
I agree craig, but I dont think if all goes well anyone will mark this post as having a workaround.

Posted: Wed Sep 11, 2013 8:31 am
by epsidude
Well I have re-opened this topic because the change did not work. Job run time was very low, and it would have taken hours to complete. So I killed, recompiled, and reran, and it completed in 10 minutes.

However, this time i did check the monitor before killing, and I did not see anything. When I checked the resource monitor I noticed that most of the disk usage had a priority of low.

What I do not think I mentioned before is that we are running on a Windows 2008 R2 Virtual Machine.

When I reran I watched, and I saw everything had a normal priority. So in my research I am finding that if the VM does not have keyboard/mouse activity it can be set to a low priority, and things can run much longer.

Has anyone had experience with this? Because whenever I run manually all is good, when I log off the VM and go home and the process starts up overnight it drags, and really the only effect is on jobs that have very large hashed files, and lookups to them.

Posted: Wed Sep 11, 2013 1:00 pm
by epsidude
System folks just told me that it is ESXi 5.1 we are using. I am not sure what this is so will start researching.