[torqueusers] jobs not clearing on crashed node

dbeer at adaptivecomputing.com dbeer at adaptivecomputing.com
Fri Feb 1 10:17:33 MST 2013


Brian,

pbs_server doesn't consider a job completed until it gets the obit. For a mom after a restart, the mom should be started in a way that tells pbs_server that the jobs are no longer running, and this will clear up the jobs. 

If you have a diskless node then the mom won't know that it had jobs running before the reboot, so you'll need to run qdel -p on the jobs from that mom to clear them. 

David

On Feb 1, 2013, at 8:47 AM, "Andrus, Brian Contractor" <bdandrus at nps.edu> wrote:

> All,
> 
> Running torque 4.1.4 here (along with moab 7.2.0)
> Issue: a node crashes that had several elements of an array job running on it.
> It reboots and gets re-provisioned and comes back up.
> pbsnodes still claims there are several jobs running on it.
> If I run (on the node) pbs_mom purge, nothing changes.
> If I restart pbs_server (which I hate doing since it resets Time Used on running jobs), nothing changes.
> 
> Shouldn't the jobs automatically either get restarted or cleared if a node reboots? I'm pretty sure torque used to do that...
> 
> 
> 
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
> 
> 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list