[torqueusers] Force a job to rerun after mom has crashed

Mahmood Naderan nt_mahmood at yahoo.com
Wed Aug 24 13:04:33 MDT 2011


>But if the failure is because of a system crash you may want the 

>start 
the mom with the -q option which will requeue jobs so they can be rerun.
 
One question regarding your reply....

Recently one of our nodes (computing node which pbs_mom is running) crashed with kernel panic message. While it was up, it was running an MPI job. Although the node crashed but I didn't notice because the job was still in the output of showq command.

After cold restart and starting pbs_mom on the node, I saw no cpu usage means that the job didn't restart. However the job id was still present in the output of showq command. So I manually use qdel to delete that job and submit the MPI job again.

What I understand from your statement is this:
node is running pbs_mom  => crashed => restart => pbs_mom -q  => you can see the job is running again.

Is my understanding correct?

// Naderan *Mahmood;


----- Original Message -----
From: Ken Nielson <knielson at adaptivecomputing.com>
To: Torque Users Mailing List <torqueusers at supercluster.org>
Cc: 
Sent: Wednesday, August 24, 2011 8:36 PM
Subject: Re: [torqueusers] Force a job to rerun after mom has crashed



----- Original Message -----
> From: "\"Mgr. Šimon Tóth\"" <toth at fi.muni.cz>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Wednesday, August 24, 2011 9:27:45 AM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> > Is there any straightforward way to force a job to rerun on a
> > different node after its MOM has crashed?
> 
> This is a PBS Pro feature not supported in Torque.
> 
> But in Torque, when a node crashes, it doesn't really mean anything.
> Once the pbs_mom process is restarted, it will detect the jobs and
> reattach them.
> 
> --
> Mgr. Simon Toth
> _______________________________________________

What Simon says is correct but there are other options. Read the man page for pbs_mom. Read options -p, -P, -q and -r. What the mom can do on a restart depends on why the mom went down and for how long. For instance if the mom crashes and restarts immediately the -p (default in 2.4 and later) is probably what you want. But if the failure is because of a system crash you may want the start the mom with the -q option which will requeue jobs so they can be rerun.

Regards

Ken
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list