Jobs getting stuck
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Jobs getting stuck
Hi All,
My job design has local containers and am running 10 instances of the same job ( after marking Allow multiple instances ).
I dont understand a strange behavior, that why the job is getting stuck while processing and also why so much of Page Faults are there ??
http://asit.agrawal.googlepages.com/Monitor.jpg
http://asit.agrawal.googlepages.com/TaskMgr.jpg
Thanks !!
My job design has local containers and am running 10 instances of the same job ( after marking Allow multiple instances ).
I dont understand a strange behavior, that why the job is getting stuck while processing and also why so much of Page Faults are there ??
http://asit.agrawal.googlepages.com/Monitor.jpg
http://asit.agrawal.googlepages.com/TaskMgr.jpg
Thanks !!
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Page faults are indicative of insufficient memory. When a memory page is required for processing but memory is full, one or more of the least recently used pages are moved out to disk ("paged") to make room. This operation is called a "page fault". In an ideal world, there would be no page faults.
It's a classic economic problem - supply and demand. Either install more memory in the machine, or reduce demand for memory (run fewer things at the same time, for example fewer instances).
You can monitor your system to discern the heaviest users of memory, and run fewer of these at the same time.
It's a classic economic problem - supply and demand. Either install more memory in the machine, or reduce demand for memory (run fewer things at the same time, for example fewer instances).
You can monitor your system to discern the heaviest users of memory, and run fewer of these at the same time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Hi,
I tried to make few more attempts to resolve the above problem..
This is the Job Design. Job Design
Input:
Input has 22Million to 30 Million rows of data. One row has some 66 fields.
Process:
The Input row keys are matched from the Hashed Files and if a lookup is found , it moves to Next and ultimately to Output.
Output:
All the rows which find a match with the Hashed File.
Problem:
The job after reading approx, 14 Million rows, does not show any increase in the number of rows
being read. By observing the Windows Task Manager, the Page Faults start increasing drastically as soon as it crosses 14Million mark !! and the processore usage is 100% and it never comes down !!!
The hardware has 16 Processors with 12GB RAM.
Changes tried:
1. Turning on the Interprocess Row buffering with 1024Kb.
2. Using a Link Partitioner between INPUT_LINES and Xmr_lines and then using a Link Collector
between Xmr_Lines and Xmr_ifLenEqZero_Line. The Xmr_Lines and the Hashed file will made as a shared container and each link between the link partitioner and link collector were connected viz different instance of the shared container.
3. Same design (as in the image), adding an IPC between Xmr_Lines and Xmr_ifLenEqZero_Line.
In all the different designs mentioned above, the Problem was exactly the same!!
Please advice!!
I tried to make few more attempts to resolve the above problem..
This is the Job Design. Job Design
Input:
Input has 22Million to 30 Million rows of data. One row has some 66 fields.
Process:
The Input row keys are matched from the Hashed Files and if a lookup is found , it moves to Next and ultimately to Output.
Output:
All the rows which find a match with the Hashed File.
Problem:
The job after reading approx, 14 Million rows, does not show any increase in the number of rows
being read. By observing the Windows Task Manager, the Page Faults start increasing drastically as soon as it crosses 14Million mark !! and the processore usage is 100% and it never comes down !!!
The hardware has 16 Processors with 12GB RAM.
Changes tried:
1. Turning on the Interprocess Row buffering with 1024Kb.
2. Using a Link Partitioner between INPUT_LINES and Xmr_lines and then using a Link Collector
between Xmr_Lines and Xmr_ifLenEqZero_Line. The Xmr_Lines and the Hashed file will made as a shared container and each link between the link partitioner and link collector were connected viz different instance of the shared container.
3. Same design (as in the image), adding an IPC between Xmr_Lines and Xmr_ifLenEqZero_Line.
In all the different designs mentioned above, the Problem was exactly the same!!
Please advice!!
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Is your server and client on same machine?
What is the version of your Windows? Each type of file system has its own restriction of maximum file size limit. Check if it falls under any of such category.
So could you confirm that, if the same job is been processed with less number of rows, it gets processed without any issues.
What is the version of your Windows? Each type of file system has its own restriction of maximum file size limit. Check if it falls under any of such category.
So could you confirm that, if the same job is been processed with less number of rows, it gets processed without any issues.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
yes my server and client are on same machine.
I amusing Win 2003 Server EE.
The job is running for successfully for 13 Million rows..
I also attempted to run it on a different set of inpu of 30 Million rows, expecting it to crash... but it ran successfully.
but if I repeat the test with the , so called Problem Input set, then the problems like, sudden increase in the page faults after 14Million rows and hence no change in the num of rows processed thereafter and job never coming to an end.... start appearing !!!
I amusing Win 2003 Server EE.
The job is running for successfully for 13 Million rows..
I also attempted to run it on a different set of inpu of 30 Million rows, expecting it to crash... but it ran successfully.
but if I repeat the test with the , so called Problem Input set, then the problems like, sudden increase in the page faults after 14Million rows and hence no change in the num of rows processed thereafter and job never coming to an end.... start appearing !!!
In this case, does your Transformer stage does such a complex calculation, like comparing previous set of records or something?
Where the set of record which you have might required more memory to involve.
Can you break the problematic file in to two. Probably a limit of 14million and try to run two times.
Where the set of record which you have might required more memory to involve.
Can you break the problematic file in to two. Probably a limit of 14million and try to run two times.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
No,
I ll tell u what each Xmr is doing:
Xmr_Lines: Matches the 6 key fields
Xmr_ifLenEqZero: Checks if any incoming fld ( total there are 67 flds ) has length = 0 or not. If zero then it assign a default value.
Xmr_CSFormatLines: Forms the 5 groups, each containg some 13 - 14 flds out of the 67 flds, to produce just 5 flds in the output.
There is noe complex calculations involved.
I even removed the last two Xmrs, and run the job with just the first Xmr_Lines, still it give the same problem !
I ll tell u what each Xmr is doing:
Xmr_Lines: Matches the 6 key fields
Xmr_ifLenEqZero: Checks if any incoming fld ( total there are 67 flds ) has length = 0 or not. If zero then it assign a default value.
Xmr_CSFormatLines: Forms the 5 groups, each containg some 13 - 14 flds out of the 67 flds, to produce just 5 flds in the output.
There is noe complex calculations involved.
I even removed the last two Xmrs, and run the job with just the first Xmr_Lines, still it give the same problem !
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Ok, I agree that the 3 Xmr may be clubbed into 1.kumar_s wrote:As you said, all these functionality can be clubbed in single Transformer. Did you try to break the file and run the job? ...
Unfortunately the problem is that due the functional reqmt of the job, I cannot split this input file. Also we are doing brainstorming session to identify a workaround.
Also, to my wonder, the same job has run for 30 Million rows, but that was a different day's input data. Since the data is live data, so I atleast need to identify the cause of problem in this particular input set of data.
What is the size of input file?
Try doing few more tests,
1. Try to read the input file with the specific metadata and directly write into a file. - This will ensure, the is issues is not with the data which doesn't comply the metadata, where the Datastage may try to convert implicitly.
2. Read the file and do a lookup and write into to another file without any conditional check.
3. With the given full condition.
This way you can nail down the problem.
Try doing few more tests,
1. Try to read the input file with the specific metadata and directly write into a file. - This will ensure, the is issues is not with the data which doesn't comply the metadata, where the Datastage may try to convert implicitly.
2. Read the file and do a lookup and write into to another file without any conditional check.
3. With the given full condition.
This way you can nail down the problem.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
-
- Premium Member
- Posts: 273
- Joined: Wed Oct 18, 2006 12:20 pm
- Location: Porto
Hey Kumar,
I tried to work on option 1, i.e
I tried to work on option 1, i.e
In this also, as soon as the num of rows being read crosses some 14Million rows, the problem appears. ( i.e the corresponding processor will start running at its peak and the delta page faults increases rapidly !!!Seq_File_Stage ----> Xmr ----> Seq_File_Stage