[torqueusers] Queued Jobs after power failure
giwrgis at chemistry.uoc.gr
Tue Mar 4 00:42:22 MST 2014
David thanks for letting me know about the qrerun command.
We are planning moving soon to Maui but unfortunately this isn't the case
yet! Anyway, I did what you 've suggested and let the scheduler do it's
> qrerun is a command for running jobs. It places jobs in the queue that are
> currently running. I don't know why it is named what it is named.
> qrun is a command to make a job run immediately, but usually its best to
> let the scheduler run the jobs. I'm not sure why pbs_sched is taking such
> long time to run jobs for you. pbs_sched doesn't have much of a user group
> as most people use Maui for anything non-trivial where they aren't going
> pay for a scheduler. qrun'ing the jobs manually is probably a good
> short-term solution, but switching to Maui might be a better long-term
> On Sat, Mar 1, 2014 at 2:05 AM, Dimitrakakis Georgios <
> giwrgis at chemistry.uoc.gr> wrote:
>> Due to a power failure half of the cluster's nodes crashed and had to be
>> Now the jobs that were running on these nodes are in a queue state.
>> I 've cycled the moms on all failed nodes using
>> momctl -C -h nodeX
>> and restarted the server (pbs_server) and scheduler (pbs_sched),
>> I didn't believe that all these were necessary.
>> Afterwards, I 've tried to rerun the jobs using
>> qrerun $JOB_ID
>> where $JOB_ID is the ID of the queued job and the output was
>> qrerun: Request invalid for state of job MSG=job $JOB_ID.nodeX is in a
>> state $JOB_ID.nodeX
>> Suddenly while I was trying to find a solution online without any
>> actions I noticed that the job had started successfully. Furthermore
>> 10-11 minutes one more job is starting....This is happening for the last
>> two hours but the queue is huge....
>> Is there another way to force all these queued jobs to start immediately
>> instead of waiting for days??
>> The Torque version I am using is : 184.108.40.206
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>> torqueusers mailing list
>> torqueusers at supercluster.org
> David Beer | Senior Software Engineer
> Adaptive Computing
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> torqueusers mailing list
> torqueusers at supercluster.org
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the torqueusers