[torqueusers] pbs_mom problem in rocks 5
Naveed Near-Ansari
naveed at caltech.edu
Fri Jul 31 17:59:00 MDT 2009
we have had some trouble with 2 installations of Torque on rocks 5 (both
5.1 and 5.2). Our installations on rocks 4 do not exhibit the same
problem.
Essentially what happens is that the pbs_moms become unresponsive every
couple of days. We get this reponse using momctl:
ERROR: query[0] 'diag3' failed on compute-1-31 (errno=0-Success:
5-Input/output error)
The node seems to be running pbs_mom, and restarting it resolves the
issue.
Sumitted jobs get rejected naturally:
Job: 17809.hostname.caltech.edu
07/31/2009 15:09:51 S enqueuing into default, state 2 hop 1
07/31/2009 15:09:51 S Job Queued at request of
username at hostname.caltech.edu, owner = username at hostname.caltech.edu,
job name = Job158Task1, queue = default
07/31/2009 15:09:51 S Job Modified at request of
username at hostname.caltech.edu
07/31/2009 15:09:51 A queue=default
07/31/2009 15:09:52 S Job Modified at request of
maui at hostname.caltech.edu
07/31/2009 15:09:52 S Job Run at request of
maui at hostname.caltech.edu
07/31/2009 15:09:52 S send of job to compute-1-31 failed error =
15008
07/31/2009 15:09:52 S unable to run job, MOM rejected/rc=1
07/31/2009 15:09:52 S Job Modified at request of
maui at hostname.caltech.edu
07/31/2009 15:14:53 S Job deleted at request of
username at hostname.caltech.edu
07/31/2009 15:14:53 S dequeuing from default, state EXITING
07/31/2009 15:14:53 A requestor=username at hostname.caltech.edu
after restarting the mom, the output from momctl looks perfectly
healthy.
Have you seen this behavior before? Anyone have a solution for it?
Naveed
More information about the torqueusers
mailing list