[torqueusers] jobs not clearing on crashed node
glen.beane at gmail.com
glen.beane at gmail.com
Fri Feb 1 12:29:37 MST 2013
On Feb 1, 2013, at 12:17 PM, dbeer at adaptivecomputing.com wrote:
> Brian,
>
> pbs_server doesn't consider a job completed until it gets the obit. For a mom after a restart, the mom should be started in a way that tells pbs_server that the jobs are no longer running, and this will clear up the jobs.
>
> If you have a diskless node then the mom won't know that it had jobs running before the reboot, so you'll need to run qdel -p on the jobs from that mom to clear them.
I think I fixed this a long time ago (in 2.x). I made it so that if the mom had no record of the job pbs_server would delete the job.
If this does not happen now it is a bug. It shouldn't require a qdel -p
>
> David
>
> On Feb 1, 2013, at 8:47 AM, "Andrus, Brian Contractor" <bdandrus at nps.edu> wrote:
>
>> All,
>>
>> Running torque 4.1.4 here (along with moab 7.2.0)
>> Issue: a node crashes that had several elements of an array job running on it.
>> It reboots and gets re-provisioned and comes back up.
>> pbsnodes still claims there are several jobs running on it.
>> If I run (on the node) pbs_mom purge, nothing changes.
>> If I restart pbs_server (which I hate doing since it resets Time Used on running jobs), nothing changes.
>>
>> Shouldn't the jobs automatically either get restarted or cleared if a node reboots? I'm pretty sure torque used to do that...
>>
>>
>>
>> Brian Andrus
>> ITACS/Research Computing
>> Naval Postgraduate School
>> Monterey, California
>> voice: 831-656-6238
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list