[torqueusers] Queued Jobs after power failure

David Beer dbeer at adaptivecomputing.com
Mon Mar 3 10:43:51 MST 2014


qrerun is a command for running jobs. It places jobs in the queue that are
currently running. I don't know why it is named what it is named.

qrun is a command to make a job run immediately, but usually its best to
let the scheduler run the jobs. I'm not sure why pbs_sched is taking such a
long time to run jobs for you. pbs_sched doesn't have much of a user group
as most people use Maui for anything non-trivial where they aren't going to
pay for a scheduler. qrun'ing the jobs manually is probably a good
short-term solution, but switching to Maui might be a better long-term

On Sat, Mar 1, 2014 at 2:05 AM, Dimitrakakis Georgios <
giwrgis at chemistry.uoc.gr> wrote:

> Due to a power failure half of the cluster's nodes crashed and had to be
> rebooted.
> Now the jobs that were running on these nodes are in a queue state.
> I 've cycled the moms on all failed nodes using
> momctl -C -h nodeX
> and restarted the server (pbs_server) and scheduler (pbs_sched), although
> I didn't believe that all these were necessary.
> Afterwards, I 've tried to rerun the jobs using
> qrerun $JOB_ID
> where $JOB_ID is the ID of the queued job and the output was
> qrerun: Request invalid for state of job MSG=job $JOB_ID.nodeX is in a bad
> state $JOB_ID.nodeX
> Suddenly while I was trying to find a solution online without any further
> actions I noticed that the job had started successfully. Furthermore every
> 10-11 minutes one more job is starting....This is happening for the last
> two hours but the queue is huge....
> Is there another way to force all these queued jobs to start immediately
> instead of waiting for days??
> The Torque version I am using is :
> Best,
> G.
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140303/bd1e3a91/attachment.html 

More information about the torqueusers mailing list