[torqueusers] Queued Jobs after power failure

David Beer dbeer at adaptivecomputing.com
Mon Mar 3 10:43:51 MST 2014


G.,

qrerun is a command for running jobs. It places jobs in the queue that are
currently running. I don't know why it is named what it is named.

qrun is a command to make a job run immediately, but usually its best to
let the scheduler run the jobs. I'm not sure why pbs_sched is taking such a
long time to run jobs for you. pbs_sched doesn't have much of a user group
as most people use Maui for anything non-trivial where they aren't going to
pay for a scheduler. qrun'ing the jobs manually is probably a good
short-term solution, but switching to Maui might be a better long-term
solution.


On Sat, Mar 1, 2014 at 2:05 AM, Dimitrakakis Georgios <
giwrgis at chemistry.uoc.gr> wrote:

> Due to a power failure half of the cluster's nodes crashed and had to be
> rebooted.
>
> Now the jobs that were running on these nodes are in a queue state.
>
> I 've cycled the moms on all failed nodes using
>
> momctl -C -h nodeX
>
> and restarted the server (pbs_server) and scheduler (pbs_sched), although
> I didn't believe that all these were necessary.
>
> Afterwards, I 've tried to rerun the jobs using
>
> qrerun $JOB_ID
>
> where $JOB_ID is the ID of the queued job and the output was
>
> qrerun: Request invalid for state of job MSG=job $JOB_ID.nodeX is in a bad
> state $JOB_ID.nodeX
>
>
> Suddenly while I was trying to find a solution online without any further
> actions I noticed that the job had started successfully. Furthermore every
> 10-11 minutes one more job is starting....This is happening for the last
> two hours but the queue is huge....
>
> Is there another way to force all these queued jobs to start immediately
> instead of waiting for days??
>
> The Torque version I am using is : 4.1.5.1
>
> Best,
>
>
> G.
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140303/bd1e3a91/attachment.html 


More information about the torqueusers mailing list