[torqueusers] All jobs hitting queue in "Q" state and staying that way (redux)

Jack Wilkinson jwilkinson at stoneeagle.com
Thu Nov 7 13:24:49 MST 2013

This is a follow up to a message I submitted to the list a few months back.  Today I submitted a test run on our dev cluster to make sure changes made to the script would work.  I encountered an old problem...  everything went into and stayed the the "Q" state.

[root at srvDevHead01 ~]# qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
3827.srvdevhead01          ...66130004.2100 eob_merge              0 Q batch
3828.srvdevhead01          ...66130004.2101 eob_merge              0 Q batch
3829.srvdevhead01          ...66130004.2102 eob_merge              0 Q batch

When I last brought this up, Brian Andrus suggested that I try...

[root at srvDevHead01 ~]# showq
ERROR:    lost connection to server
ERROR:    cannot request service (status)

This led me to try...

# service maui stop
Shutting down MAUI Scheduler: ERROR:    lost connection to server
ERROR:    cannot request service (status)

After some further playing around, I eventually rebooted the headbox and showq came back with the expected result.  And following submissions worked correctly.  We've seen this both in our dev cluster and in our production cluster.  I don't mind rebooting part or all of the dev cluster.  But having the production cluster hang isn't a good thing.

I will offer that this -seems- to happen only after the cluster has been sitting without doing any work for a while... a week+.

Anyone have any ideas of things I should look into?


Jack Wilkinson, Programmer
Services | VPay(r)
P: 972.367-6622
jwilkinson at stoneeagle.com<mailto:jwilkinson at stoneeagle.com>

111 W. Spring Valley Rd., #100
Richardson, TX 75081

CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131107/e013eff2/attachment.html 

More information about the torqueusers mailing list