[torqueusers] removal of "stray jobs"
dbeer at adaptivecomputing.com
Mon Dec 10 09:26:50 MST 2012
There is a fix for this in place that will be released with 4.1.4. I'm not
sure exactly how it happens, but we added some functionality that makes the
mom retry sending obits for jobs that are stuck in the exiting state on the
On Mon, Dec 10, 2012 at 2:28 AM, Lech Nieroda <nieroda.lech at uni-koeln.de>wrote:
> Dear list,
> we are currently running Torque 4.1.3 with Maui 3.3.1. The option
> "mom_job_sync" is on. However, we get "stray" jobs quite often - these
> are jobs that remain in an "EXITING" state for whatever reason and their
> <jobid>.JB files are often left lying around.
> Our workaround: at first we've tried to delete the JB files and restart
> the pbs_mom daemon but it turns out that a simple "momctl -h <host> -c
> <jobid>" does the job as well. An appropriate script runs now daily with
> cron and removes such jobs.
> So, when the server discovers a "stray job" he has the means to send a
> "cleaning" command to the pbs_mom but apparently doesn't do it and we
> have to do it manually.
> Any option to fix that? Is it a bug?
> Lech Nieroda
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Senior Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers