[torqueusers] Jobs Jumping From Queued to Run to Queued???

Ben Turner ben at dayborogeo.com
Mon Apr 5 17:27:37 MDT 2010


I have a cluster with five nodes all configured the same. The queueing
system is working fine on four of them, but on one of them jobs get queued,
then start to run and then jump back to queued state immediately.

I have looked through all the logs and the only clue I can get is the
mom_log on the compute node that fails

Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get protocol
End of File

On the server there is a 
PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM rejected/rc=2
PBS_Server;Req;req_reject;Reject reply code 15041(Execution server rejected
request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob from
Scheduler at merlion00.qgeo.com

Pbsnodes -a reports that the dodgy node merlion05 is OK.

Do anybody have any insight into this problem. I have been bashing my head
against a wall and have no idea where to go.


More information about the torqueusers mailing list