[torqueusers] Force a job to rerun after mom has crashed
Mahmood Naderan
nt_mahmood at yahoo.com
Wed Aug 24 13:04:33 MDT 2011
>But if the failure is because of a system crash you may want the
>start
the mom with the -q option which will requeue jobs so they can be rerun.
One question regarding your reply....
Recently one of our nodes (computing node which pbs_mom is running) crashed with kernel panic message. While it was up, it was running an MPI job. Although the node crashed but I didn't notice because the job was still in the output of showq command.
After cold restart and starting pbs_mom on the node, I saw no cpu usage means that the job didn't restart. However the job id was still present in the output of showq command. So I manually use qdel to delete that job and submit the MPI job again.
What I understand from your statement is this:
node is running pbs_mom => crashed => restart => pbs_mom -q => you can see the job is running again.
Is my understanding correct?
// Naderan *Mahmood;
----- Original Message -----
From: Ken Nielson <knielson at adaptivecomputing.com>
To: Torque Users Mailing List <torqueusers at supercluster.org>
Cc:
Sent: Wednesday, August 24, 2011 8:36 PM
Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
----- Original Message -----
> From: "\"Mgr. Šimon Tóth\"" <toth at fi.muni.cz>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Wednesday, August 24, 2011 9:27:45 AM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> > Is there any straightforward way to force a job to rerun on a
> > different node after its MOM has crashed?
>
> This is a PBS Pro feature not supported in Torque.
>
> But in Torque, when a node crashes, it doesn't really mean anything.
> Once the pbs_mom process is restarted, it will detect the jobs and
> reattach them.
>
> --
> Mgr. Simon Toth
> _______________________________________________
What Simon says is correct but there are other options. Read the man page for pbs_mom. Read options -p, -P, -q and -r. What the mom can do on a restart depends on why the mom went down and for how long. For instance if the mom crashes and restarts immediately the -p (default in 2.4 and later) is probably what you want. But if the failure is because of a system crash you may want the start the mom with the -q option which will requeue jobs so they can be rerun.
Regards
Ken
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list