[torqueusers] pbs_mom problem in rocks 5

Naveed Near-Ansari naveed at caltech.edu
Fri Jul 31 17:59:00 MDT 2009


we have had some trouble with 2 installations of Torque on rocks 5 (both
5.1 and 5.2).  Our installations on rocks 4 do not exhibit the same
problem.

Essentially what happens is that the pbs_moms become unresponsive every
couple of days.  We get this reponse using momctl:


ERROR:    query[0] 'diag3' failed on compute-1-31 (errno=0-Success:
5-Input/output error)


The node seems to be running pbs_mom, and restarting it resolves the
issue.

Sumitted jobs get rejected naturally:

Job: 17809.hostname.caltech.edu

07/31/2009 15:09:51  S    enqueuing into default, state 2 hop 1
07/31/2009 15:09:51  S    Job Queued at request of
username at hostname.caltech.edu, owner = username at hostname.caltech.edu,
job name = Job158Task1, queue = default
07/31/2009 15:09:51  S    Job Modified at request of
username at hostname.caltech.edu
07/31/2009 15:09:51  A    queue=default
07/31/2009 15:09:52  S    Job Modified at request of
maui at hostname.caltech.edu
07/31/2009 15:09:52  S    Job Run at request of
maui at hostname.caltech.edu
07/31/2009 15:09:52  S    send of job to compute-1-31 failed error =
15008
07/31/2009 15:09:52  S    unable to run job, MOM rejected/rc=1
07/31/2009 15:09:52  S    Job Modified at request of
maui at hostname.caltech.edu
07/31/2009 15:14:53  S    Job deleted at request of
username at hostname.caltech.edu
07/31/2009 15:14:53  S    dequeuing from default, state EXITING
07/31/2009 15:14:53  A    requestor=username at hostname.caltech.edu


after restarting the mom,  the output from momctl looks perfectly
healthy.


Have you seen this behavior before?  Anyone have a solution for it?

Naveed



More information about the torqueusers mailing list