[torqueusers] Cluster Node changing to state "down,job-exclusive"

David McGiven david.mcgiven at fusemail.com
Fri Feb 17 05:33:05 MST 2006


Dear Torque Users,

I have a queue with only one node. I use this node to run a specific kind
of jobs.

Whenever I submit a job, it gets into the queue, maui tells the mom to
start running it and it starts.

First the node is marked as "free", then is marked as "job-exclusive". I
ssh to the node and I see the process is running taking 99% CPU.

This should be the normal behaviour. Then the weird things start.

I wait a few seconds/minutes and the state changes to "down,
job-exclusive". I ssh to the node and I see the process is STILL running,
taking 99% CPU. The node is ok, the pbs_mom is running and I can contact
to it with momctl from the central server.

Then I issue a :
bash# qdel 407
qdel: Server could not connect to MOM 407.server

I cannot delete the job. The only solution is doing :

bash# qmgr
Max open servers: 4
Qmgr: set node nodo18 state -= down

And then :
bash# qdel 407

Works

Does anybody know why torque is behaving like this ? Do you know which
logfiles or tools should I check ? (Checking
/var/spool/PBS/mom_logs/logfile didn't help me diagnose the problem).

Thank you very much in advance.

Regards,

David McGiven
CTO


More information about the torqueusers mailing list