[torqueusers] mom_job_sync not working???

Eva Hocks hocks at sdsc.edu
Fri Aug 2 14:10:25 MDT 2013



looks like mom is not synchronized with the server.

pbsnodes still shows the jobs even though they are not in the
..mom_priv/jobs directory.

jobs = 0/236754.tscc-mgr.local, 2/236765.tscc-mgr.local, 6/236776.tscc-mgr.local


and mom rejected the server requests at start up:

07/31/2013 13:29:15;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at tscc-mgr.local

A torque bug????

Thanks
Eva



On Fri, 2 Aug 2013, Eva Hocks wrote:

>
>
> I am running torque 3.0.5 and have set the mom_job_sync  = True (which
> is the default since 2.0 anyway)
>
> Nevertheless, jobs in state E do not get removed after the node was
> resinstalled and obvioulsy does not know about the job any more.
>
>
> from the server logs:
>
> 07/31/2013 11:24:10;0008;PBS_Server;Job;236751.tscc-mgr.local;Job deleted at request of gpratt at tscc-login2.local
> 07/31/2013 11:24:10;0008;PBS_Server;Job;236751.tscc-mgr.local;Job sent signal SIGTERM on delete
> 07/31/2013 11:24:18;0008;PBS_Server;Job;236751.tscc-mgr.local;Job sent signal SIGKILL on delete
> 07/31/2013 11:25:18;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job nanny, exiting job '236751.tscc-mgr.local' still exists,
>  sending a SIGKILL
> 07/31/2013 11:29:24;0010;PBS_Server;Job;236751.tscc-mgr.local;Exit_status=271 resources_used.cput=00:01:02 resources_used.mem=78792kb resources_used.vmem=151808kb resources_used.walltime=00:11:01
> 07/31/2013 12:36:35;0080;PBS_Server;Job;236751.tscc-mgr.local;Request invalid for state of job EXITING
>
>
> and the job never gets cleaned out. From the manual "If a job exists on
> a compute node in a pre-execution or corrupt state, it will be
> automatically cleaned up and purged."  mom_job_sync should sent a purge
> request instead of a qdel.
>
>  Am I missing something?
>
> Thanks
> Eva
>
>



More information about the torqueusers mailing list