Jobs getting stuck

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Jobs getting stuck

Post by asitagrawal »

Hi All,

My job design has local containers and am running 10 instances of the same job ( after marking Allow multiple instances ).

I dont understand a strange behavior, that why the job is getting stuck while processing and also why so much of Page Faults are there ??


http://asit.agrawal.googlepages.com/Monitor.jpg

http://asit.agrawal.googlepages.com/TaskMgr.jpg


Thanks !!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Page faults are indicative of insufficient memory. When a memory page is required for processing but memory is full, one or more of the least recently used pages are moved out to disk ("paged") to make room. This operation is called a "page fault". In an ideal world, there would be no page faults.

It's a classic economic problem - supply and demand. Either install more memory in the machine, or reduce demand for memory (run fewer things at the same time, for example fewer instances).

You can monitor your system to discern the heaviest users of memory, and run fewer of these at the same time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

As noted, monitor the same with one or two jobs running in parallel.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

As noted, not so much 'stuck' as 'very slow'. Run fewer jobs at a time. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hi,

I tried to make few more attempts to resolve the above problem..

This is the Job Design. Job Design


Input:
Input has 22Million to 30 Million rows of data. One row has some 66 fields.
Process:
The Input row keys are matched from the Hashed Files and if a lookup is found , it moves to Next and ultimately to Output.
Output:
All the rows which find a match with the Hashed File.

Problem:

The job after reading approx, 14 Million rows, does not show any increase in the number of rows
being read. By observing the Windows Task Manager, the Page Faults start increasing drastically as soon as it crosses 14Million mark !! and the processore usage is 100% and it never comes down !!!

The hardware has 16 Processors with 12GB RAM.

Changes tried:

1. Turning on the Interprocess Row buffering with 1024Kb.
2. Using a Link Partitioner between INPUT_LINES and Xmr_lines and then using a Link Collector
between Xmr_Lines and Xmr_ifLenEqZero_Line. The Xmr_Lines and the Hashed file will made as a shared container and each link between the link partitioner and link collector were connected viz different instance of the shared container.
3. Same design (as in the image), adding an IPC between Xmr_Lines and Xmr_ifLenEqZero_Line.

In all the different designs mentioned above, the Problem was exactly the same!!

Please advice!!
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Something more:

I also found that:
1) Each Xmr stage had a different PID
2) The PID correspomding to Xmr_Lines was the one for which Page Fault goes high.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Is your server and client on same machine?
What is the version of your Windows? Each type of file system has its own restriction of maximum file size limit. Check if it falls under any of such category.
So could you confirm that, if the same job is been processed with less number of rows, it gets processed without any issues.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

yes my server and client are on same machine.
I amusing Win 2003 Server EE.

The job is running for successfully for 13 Million rows..
I also attempted to run it on a different set of inpu of 30 Million rows, expecting it to crash... but it ran successfully.
but if I repeat the test with the , so called Problem Input set, then the problems like, sudden increase in the page faults after 14Million rows and hence no change in the num of rows processed thereafter and job never coming to an end.... start appearing !!!
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

In this case, does your Transformer stage does such a complex calculation, like comparing previous set of records or something?
Where the set of record which you have might required more memory to involve.
Can you break the problematic file in to two. Probably a limit of 14million and try to run two times.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

No,
I ll tell u what each Xmr is doing:

Xmr_Lines: Matches the 6 key fields
Xmr_ifLenEqZero: Checks if any incoming fld ( total there are 67 flds ) has length = 0 or not. If zero then it assign a default value.
Xmr_CSFormatLines: Forms the 5 groups, each containg some 13 - 14 flds out of the 67 flds, to produce just 5 flds in the output.

There is noe complex calculations involved.

I even removed the last two Xmrs, and run the job with just the first Xmr_Lines, still it give the same problem !
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

As you said, all these functionality can be clubbed in single Transformer. Did you try to break the file and run the job?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

kumar_s wrote:As you said, all these functionality can be clubbed in single Transformer. Did you try to break the file and run the job? ...
Ok, I agree that the 3 Xmr may be clubbed into 1.
Unfortunately the problem is that due the functional reqmt of the job, I cannot split this input file. Also we are doing brainstorming session to identify a workaround.

Also, to my wonder, the same job has run for 30 Million rows, but that was a different day's input data. Since the data is live data, so I atleast need to identify the cause of problem in this particular input set of data.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

What is the size of input file?
Try doing few more tests,
1. Try to read the input file with the specific metadata and directly write into a file. - This will ensure, the is issues is not with the data which doesn't comply the metadata, where the Datastage may try to convert implicitly.
2. Read the file and do a lookup and write into to another file without any conditional check.
3. With the given full condition.
This way you can nail down the problem.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Dear kumar,

I already mentioned in y prev post:
I even removed the last two Xmrs, and run the job with just the first Xmr_Lines, still it give the same problem !
And hence this is line with ur input #2.

I will attempt the remaining trials... and hope we hit the problem!!!

Thx :)
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hey Kumar,

I tried to work on option 1, i.e
Seq_File_Stage ----> Xmr ----> Seq_File_Stage
In this also, as soon as the num of rows being read crosses some 14Million rows, the problem appears. ( i.e the corresponding processor will start running at its peak and the delta page faults increases rapidly !!!
Post Reply