[torqueusers] Force a job to rerun after mom has crashed
scrusan at ur.rochester.edu
Wed Aug 24 16:32:21 MDT 2011
-----BEGIN PGP SIGNED MESSAGE-----
On Aug 24, 2011, at 6:03 PM, Ken Nielson wrote:
> ----- Original Message -----
>> From: "David Sheen" <sheen at usc.edu>
>> To: "Ken Nielson" <knielson at adaptivecomputing.com>
>> Cc: "Mahmood Naderan" <nt_mahmood at yahoo.com>, "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Sent: Wednesday, August 24, 2011 2:53:25 PM
>> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
>> The node has been taken offline by the administrator for testing.
> Not a good practice with MOMs running jobs. However, you can still run the pbs_mom -q when it restarts. But I am not sure if the job will still be at the server or not. If it is not at the server then the job is lost.
What we usually do is set a reservation on the node starting immediately for all of it's resources, that lasts forever. So once the jobs are finished running on the node, no more can start, and THEN you can take the node's pbs_mom offline.
> torqueusers mailing list
> torqueusers at supercluster.org
Center for Research Computing
University of Rochester
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
-----END PGP SIGNATURE-----
More information about the torqueusers