Nicholas Geraedts ngeraedts at gmail.com
Fri Apr 4 18:51:59 MDT 2008

More troubles and headaches about this setup...I'm having trouble tracking
down two issues that have been keeping our scheduler from working properly.

First off - our Maui connectivity to the Torque scheduler is intermittent.
Sometimes, showq works instantly, and others it either takes a long time, or
times out with the following error:

ERROR:    lost connection to server
ERROR:    cannot request service (status)

I have tried to figure out where this problem lies, but I have been
unsuccessful in doing so. I have checked that the time on all the nodes are
the same (to within a matter of seconds). During this time, both pbs_server
and maui are started. pbs_server seems to be working normally, since Torque
related actions such as qstat work as expected.

Secondly - jobs seem to be marked as Blocked instead of Idle, even though
there are sufficient resources available. Occasionally, the error refers to
insufficient resources available (even though there are plenty of free
compute nodes), and others have the following error under checkjob:

Messages:  cannot start job - RM failure, rc: 15031, msg: 'Premature end of

We have a number of jobs in the queue that have been running for a number of
weeks, so clearing the queue isn't really an option.

I had thought about closing the queue to prevent any new submissions, and
then waiting for all the current jobs to finish. Once that was done, we
could clean out the current installs of Maui and Torque and start fresh.
We'd be looking at quite a bit of downtime in the meantime though, so any
other solutions would be preferable.

-Nicholas Geraedts
