[torqueusers] removal of "stray jobs"

David Beer dbeer at adaptivecomputing.com
Mon Dec 10 09:26:50 MST 2012


There is a fix for this in place that will be released with 4.1.4. I'm not
sure exactly how it happens, but we added some functionality that makes the
mom retry sending obits for jobs that are stuck in the exiting state on the
mom.

David

On Mon, Dec 10, 2012 at 2:28 AM, Lech Nieroda <nieroda.lech at uni-koeln.de>wrote:

> Dear list,
>
> we are currently running Torque 4.1.3 with Maui 3.3.1. The option
> "mom_job_sync" is on. However, we get "stray" jobs quite often - these
> are jobs that remain in an "EXITING" state for whatever reason and their
> <jobid>.JB files are often left lying around.
>
> Our workaround: at first we've tried to delete the JB files and restart
> the pbs_mom daemon but it turns out that a simple "momctl -h <host> -c
> <jobid>" does the job as well. An appropriate script runs now daily with
> cron and removes such jobs.
>
> So, when the server discovers a "stray job" he has the means to send a
> "cleaning" command to the pbs_mom but apparently doesn't do it and we
> have to do it manually.
>
> Any option to fix that? Is it a bug?
>
> Regards,
> Lech Nieroda
>
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Deutschland
>
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121210/dc2dadc0/attachment.html 


More information about the torqueusers mailing list