[torquedev] torque jobs are stuck in server queue

krishna ramachandran ramach1776 at yahoo.com
Mon Jul 21 16:39:41 MDT 2008


I have 2 small clusters (torque/moab)  containing 2 and 8 nodes (with procs set to 8 for each node np=8) respectively. These 2 clusters are completely independent. 

Once a while jobs are not getting dequeued from the torque server even though these jobs completed successfully and the nodes sent OBIT 

On 2 node cluster when jobs fail to dequeue I consistently see this message in server log (also in tracejob output) 

07/19/2008 17:39:54  S    Reject reply code=15001(Unknown Job Id), aux=0,
                          type=JobObituary, from
                          pbs_mom at ac4-int2sav-004.adx.pool.ac4.yahoo.com


On 8 node cluster we see this

07/19/2008 17:42:55  S    Reject reply code=15052(unknown job id after clean
                          init), aux=0, type=JobObituary, from
                          pbs_mom at ac4-int2ctpmynacluster-012.adx.pool.ac4.yahoo.com

we are running torque version 2.3.0-snap.200805071513 in a virtual environment

any suggestions on  what may cause this?

Krishna
 



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080721/3f68d508/attachment.html


More information about the torquedev mailing list