[torqueusers] Torque 4.1.4: Running jobs discrepancy

David Beer dbeer at adaptivecomputing.com
Fri Jan 11 10:17:42 MST 2013


Joerg,

There are different potential reasons this could happen. For example, when
jobs are started on a node, the server will have an immediate record of the
jobs being there, but the mom's update is only sent every 45 seconds by
default (some clusters increase this time), so it is completely feasible
that jobs might have a period where no update since the jobs were started
has been yet received by the mom.

If you don't think this is due to the above-described scenario, can you
provide some more details of what happens to get into this state? How long
does this state persist? Does it get cleaned up? Do you have messages about
rejected job obituaries in the server logs?

David

On Fri, Jan 11, 2013 at 5:31 AM, Joerg Blank <j.blank at fz-juelich.de> wrote:

> Hello everyone,
>
> I recently upgraded to Torque 4.1.4 and got the following problem. There
> seems to be a mix up regarding running jobs:
>
> Please note the discrepancy between "jobs" and "status/jobs"
>
> c-22
>      state = job-exclusive
>      np = 8
>      properties = barcelona,bigmem
>      ntype = cluster
>      jobs = 0/29938[34].cluster, 1/29938[36].cluster,
> 2/29938[30].cluster, 3/29938[38].cluster, 4/29938[2].cluster,
> 5/29938[3].cluster, 6/29938[37].cluster, 7/29938[4].cluster
>      status =
>
> rectime=1357906856,varattr=,jobs=29938[30].cluster,state=free,netload=360080362410,gres=,loadave=1.01,ncpus=8,physmem=66180608kb,availmem=128250632kb,totmem=133289468kb,idletime=411887,nusers=1,nsessions=1,sessions=9653,uname=Linux
> c-22 3.7.0.20121211 #4 SMP Tue Dec 11 20:52:45 CET 2012 x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
>
> Regards,
> Jörg Blank
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130111/1d6c49d5/attachment-0001.html 


More information about the torqueusers mailing list