[torqueusers] Queued Jobs after power failure

Dimitrakakis Georgios giwrgis at chemistry.uoc.gr
Tue Mar 4 00:42:22 MST 2014


David thanks for letting me know about the qrerun command.

We are planning moving soon to Maui but unfortunately this isn't the case
yet! Anyway, I did what you 've suggested and let the scheduler do it's
job.

Best,

G.

> G.,
>
> qrerun is a command for running jobs. It places jobs in the queue that are
> currently running. I don't know why it is named what it is named.
>
> qrun is a command to make a job run immediately, but usually its best to
> let the scheduler run the jobs. I'm not sure why pbs_sched is taking such
> a
> long time to run jobs for you. pbs_sched doesn't have much of a user group
> as most people use Maui for anything non-trivial where they aren't going
> to
> pay for a scheduler. qrun'ing the jobs manually is probably a good
> short-term solution, but switching to Maui might be a better long-term
> solution.
>
>
> On Sat, Mar 1, 2014 at 2:05 AM, Dimitrakakis Georgios <
> giwrgis at chemistry.uoc.gr> wrote:
>
>> Due to a power failure half of the cluster's nodes crashed and had to be
>> rebooted.
>>
>> Now the jobs that were running on these nodes are in a queue state.
>>
>> I 've cycled the moms on all failed nodes using
>>
>> momctl -C -h nodeX
>>
>> and restarted the server (pbs_server) and scheduler (pbs_sched),
>> although
>> I didn't believe that all these were necessary.
>>
>> Afterwards, I 've tried to rerun the jobs using
>>
>> qrerun $JOB_ID
>>
>> where $JOB_ID is the ID of the queued job and the output was
>>
>> qrerun: Request invalid for state of job MSG=job $JOB_ID.nodeX is in a
>> bad
>> state $JOB_ID.nodeX
>>
>>
>> Suddenly while I was trying to find a solution online without any
>> further
>> actions I noticed that the job had started successfully. Furthermore
>> every
>> 10-11 minutes one more job is starting....This is happening for the last
>> two hours but the queue is huge....
>>
>> Is there another way to force all these queued jobs to start immediately
>> instead of waiting for days??
>>
>> The Torque version I am using is : 4.1.5.1
>>
>> Best,
>>
>>
>> G.
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


-- 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the torqueusers mailing list