[Mauiusers] Reallocating jobs when a node crashes

Nathalia Garces Ferreira n.garces26 at uniandes.edu.co
Wed May 23 08:47:11 MDT 2012


Good day all,

 

I have install torque/maui for an opportunistic grid based on gLite 3.2. I
have some problems when a node that is running a job crashes. After a few
moments it seems that PBS realizes that the node is “down” and that the job
can’t continue BUT it doesn´t re-launch it. The job still appears like
Running of the “down” resource!!! I´ve tried to re-launch or reallocate it
manually but the scheduler doesn´t let me.  When I cancel it by force the
jobs stays between the state “Deffered” or “BatchHold” and no releasehold or
qrls command changes it. 

I´ve tried to find a way to configure via Torque or MAUI the parameter to
relaunch a job that is allocated on a “down” resource but I only found MOAB
parameters. 

Can you help me out? I would appreciate very much your help,

 

Nathalia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120523/f4051c38/attachment-0001.html 


More information about the mauiusers mailing list