[torqueusers] intermittent qsub failures with 4.2.7
cartley at hotmail.com
Fri Mar 21 15:33:49 MDT 2014
Hello, last November I inquired here about intermittent qsub failures that we see several times a day on our cluster. We were using 4.1.6, and a reply here indicated that this was a known problem that should be fixed in the (then-forthcoming) 4.2.7 release.
Today I had a chance to build the new packages and apply them to all of the nodes as well as the server. (That went very well, by the way. I stopped the queues, let them drain out, refreshed and restarted everything, and the jobs started releasing again. Very nice!)
However, I still got a couple of these qsub failures in a batch of 700+ jobs.
Exit code = 196
Error: qsub: submit error (Invalid request MSG=cannot locate new job 1320429.h2 (0 - Success))
So, do others see this? Am I missing some other configuration detail?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers