[torqueusers] Queued Jobs after power failure

Dimitrakakis Georgios giwrgis at chemistry.uoc.gr
Sat Mar 1 02:05:43 MST 2014


Due to a power failure half of the cluster's nodes crashed and had to be
rebooted.

Now the jobs that were running on these nodes are in a queue state.

I 've cycled the moms on all failed nodes using

momctl -C -h nodeX

and restarted the server (pbs_server) and scheduler (pbs_sched), although
I didn't believe that all these were necessary.

Afterwards, I 've tried to rerun the jobs using

qrerun $JOB_ID

where $JOB_ID is the ID of the queued job and the output was

qrerun: Request invalid for state of job MSG=job $JOB_ID.nodeX is in a bad
state $JOB_ID.nodeX


Suddenly while I was trying to find a solution online without any further
actions I noticed that the job had started successfully. Furthermore every
10-11 minutes one more job is starting....This is happening for the last
two hours but the queue is huge....

Is there another way to force all these queued jobs to start immediately
instead of waiting for days??

The Torque version I am using is : 4.1.5.1

Best,


G.


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the torqueusers mailing list