Jobs getting stuck

asitagrawal · Post by **asitagrawal** » Fri Mar 02, 2007 9:38 am

Hi All,

My job design has local containers and am running 10 instances of the same job ( after marking Allow multiple instances ).

I dont understand a strange behavior, that why the job is getting stuck while processing and also why so much of Page Faults are there ??

http://asit.agrawal.googlepages.com/Monitor.jpg

http://asit.agrawal.googlepages.com/TaskMgr.jpg

Thanks !!

ray.wurlod · Post by **ray.wurlod** » Fri Mar 02, 2007 3:08 pm

Page faults are indicative of insufficient memory. When a memory page is required for processing but memory is full, one or more of the least recently used pages are moved out to disk ("paged") to make room. This operation is called a "page fault". In an ideal world, there would be no page faults.

It's a classic economic problem - supply and demand. Either install more memory in the machine, or reduce demand for memory (run fewer things at the same time, for example fewer instances).

You can monitor your system to discern the heaviest users of memory, and run fewer of these at the same time.

kumar_s · Post by **kumar_s** » Fri Mar 02, 2007 9:04 pm

As noted, monitor the same with one or two jobs running in parallel.

chulett · Post by **chulett** » Fri Mar 02, 2007 10:12 pm

As noted, not so much 'stuck' as 'very slow'. Run fewer jobs at a time.

asitagrawal · Post by **asitagrawal** » Tue Mar 06, 2007 10:19 am

Hi,

I tried to make few more attempts to resolve the above problem..

This is the Job Design. Job Design

Input:
Input has 22Million to 30 Million rows of data. One row has some 66 fields.
Process:
The Input row keys are matched from the Hashed Files and if a lookup is found , it moves to Next and ultimately to Output.
Output:
All the rows which find a match with the Hashed File.

Problem:

The job after reading approx, 14 Million rows, does not show any increase in the number of rows
being read. By observing the Windows Task Manager, the Page Faults start increasing drastically as soon as it crosses 14Million mark !! and the processore usage is 100% and it never comes down !!!

The hardware has 16 Processors with 12GB RAM.

Changes tried:

1. Turning on the Interprocess Row buffering with 1024Kb.
2. Using a Link Partitioner between INPUT_LINES and Xmr_lines and then using a Link Collector
between Xmr_Lines and Xmr_ifLenEqZero_Line. The Xmr_Lines and the Hashed file will made as a shared container and each link between the link partitioner and link collector were connected viz different instance of the shared container.
3. Same design (as in the image), adding an IPC between Xmr_Lines and Xmr_ifLenEqZero_Line.

In all the different designs mentioned above, the Problem was exactly the same!!

Please advice!!

asitagrawal · Post by **asitagrawal** » Tue Mar 06, 2007 10:27 am

Something more:

I also found that:
1) Each Xmr stage had a different PID
2) The PID correspomding to Xmr_Lines was the one for which Page Fault goes high.

kumar_s · Post by **kumar_s** » Tue Mar 06, 2007 5:06 pm

Is your server and client on same machine?
What is the version of your Windows? Each type of file system has its own restriction of maximum file size limit. Check if it falls under any of such category.
So could you confirm that, if the same job is been processed with less number of rows, it gets processed without any issues.

asitagrawal · Post by **asitagrawal** » Wed Mar 07, 2007 2:32 am

yes my server and client are on same machine.
I amusing Win 2003 Server EE.

The job is running for successfully for 13 Million rows..
I also attempted to run it on a different set of inpu of 30 Million rows, expecting it to crash... but it ran successfully.
but if I repeat the test with the , so called Problem Input set, then the problems like, sudden increase in the page faults after 14Million rows and hence no change in the num of rows processed thereafter and job never coming to an end.... start appearing !!!

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 2:56 am

In this case, does your Transformer stage does such a complex calculation, like comparing previous set of records or something?
Where the set of record which you have might required more memory to involve.
Can you break the problematic file in to two. Probably a limit of 14million and try to run two times.

asitagrawal · Post by **asitagrawal** » Wed Mar 07, 2007 3:01 am

No,
I ll tell u what each Xmr is doing:

Xmr_Lines: Matches the 6 key fields
Xmr_ifLenEqZero: Checks if any incoming fld ( total there are 67 flds ) has length = 0 or not. If zero then it assign a default value.
Xmr_CSFormatLines: Forms the 5 groups, each containg some 13 - 14 flds out of the 67 flds, to produce just 5 flds in the output.

There is noe complex calculations involved.

I even removed the last two Xmrs, and run the job with just the first Xmr_Lines, still it give the same problem !

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 3:04 am

As you said, all these functionality can be clubbed in single Transformer. Did you try to break the file and run the job?

asitagrawal · Post by **asitagrawal** » Wed Mar 07, 2007 3:10 am

kumar_s wrote:As you said, all these functionality can be clubbed in single Transformer. Did you try to break the file and run the job? ...

Ok, I agree that the 3 Xmr may be clubbed into 1.
Unfortunately the problem is that due the functional reqmt of the job, I cannot split this input file. Also we are doing brainstorming session to identify a workaround.

Also, to my wonder, the same job has run for 30 Million rows, but that was a different day's input data. Since the data is live data, so I atleast need to identify the cause of problem in this particular input set of data.

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 3:15 am

What is the size of input file?
Try doing few more tests,
1. Try to read the input file with the specific metadata and directly write into a file. - This will ensure, the is issues is not with the data which doesn't comply the metadata, where the Datastage may try to convert implicitly.
2. Read the file and do a lookup and write into to another file without any conditional check.
3. With the given full condition.
This way you can nail down the problem.

asitagrawal · Post by **asitagrawal** » Wed Mar 07, 2007 4:22 am

Dear kumar,

I already mentioned in y prev post:

I even removed the last two Xmrs, and run the job with just the first Xmr_Lines, still it give the same problem !

And hence this is line with ur input #2.

I will attempt the remaining trials... and hope we hit the problem!!!

Thx

asitagrawal · Post by **asitagrawal** » Wed Mar 07, 2007 1:57 pm

Hey Kumar,

I tried to work on option 1, i.e

Seq_File_Stage ----> Xmr ----> Seq_File_Stage

In this also, as soon as the num of rows being read crosses some 14Million rows, the problem appears. ( i.e the corresponding processor will start running at its peak and the delta page faults increases rapidly !!!