[torqueusers] Cluster Node changing to state "down,job-exclusive"
David McGiven
david.mcgiven at fusemail.com
Fri Feb 17 05:33:05 MST 2006
Dear Torque Users,
I have a queue with only one node. I use this node to run a specific kind
of jobs.
Whenever I submit a job, it gets into the queue, maui tells the mom to
start running it and it starts.
First the node is marked as "free", then is marked as "job-exclusive". I
ssh to the node and I see the process is running taking 99% CPU.
This should be the normal behaviour. Then the weird things start.
I wait a few seconds/minutes and the state changes to "down,
job-exclusive". I ssh to the node and I see the process is STILL running,
taking 99% CPU. The node is ok, the pbs_mom is running and I can contact
to it with momctl from the central server.
Then I issue a :
bash# qdel 407
qdel: Server could not connect to MOM 407.server
I cannot delete the job. The only solution is doing :
bash# qmgr
Max open servers: 4
Qmgr: set node nodo18 state -= down
And then :
bash# qdel 407
Works
Does anybody know why torque is behaving like this ? Do you know which
logfiles or tools should I check ? (Checking
/var/spool/PBS/mom_logs/logfile didn't help me diagnose the problem).
Thank you very much in advance.
Regards,
David McGiven
CTO
More information about the torqueusers
mailing list