[torqueusers] Queued Jobs after power failure
giwrgis at chemistry.uoc.gr
Sat Mar 1 02:05:43 MST 2014
Due to a power failure half of the cluster's nodes crashed and had to be
Now the jobs that were running on these nodes are in a queue state.
I 've cycled the moms on all failed nodes using
momctl -C -h nodeX
and restarted the server (pbs_server) and scheduler (pbs_sched), although
I didn't believe that all these were necessary.
Afterwards, I 've tried to rerun the jobs using
where $JOB_ID is the ID of the queued job and the output was
qrerun: Request invalid for state of job MSG=job $JOB_ID.nodeX is in a bad
Suddenly while I was trying to find a solution online without any further
actions I noticed that the job had started successfully. Furthermore every
10-11 minutes one more job is starting....This is happening for the last
two hours but the queue is huge....
Is there another way to force all these queued jobs to start immediately
instead of waiting for days??
The Torque version I am using is : 184.108.40.206
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the torqueusers