[torqueusers] intermittent qsub failures
cartley at hotmail.com
Tue Nov 19 19:20:46 MST 2013
I am seeing intermittent qsub failures. It seems to be related to load --- several hundred jobs submitted. Every once in a while, qsub fails with "Unknown Job Id Error" or "can not locate new job":
Exit code = 153
Error: qsub: submit error (Unknown Job Id Error)
Exit code = 196
Error: qsub: submit error (Invalid request MSG=can not locate new job 630254.h2 (0 - Success))
In the server log, I find messages like these:
11/19/2013 01:16:42;0080;PBS_Server.27108;Job;625027.h2;Unknown Job Id Error
code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
type=DeleteJob, from joeuser at g4
11/19/2013 14:41:44;0001;PBS_Server.29564;Svr;PBS_Server;LOG_ERROR::Invalid request (15004) in req_jobscript, can not locate new job 630254.h2 (0 - Success)
11/19/2013 14:41:44;0100;PBS_Server.27141;Job;630253.h2;enqueuing into parallel, state 1 hop 1
11/19/2013 14:41:44;0080;PBS_Server.29564;Req;req_reject;Reject reply code=15004(Invalid request MSG=can not locate new job 630254.h2 (0 - Success)), aux=0, type=JobScript, from joeuser at g4
So far I haven't found anything helpful. Please let me know if you
have idea what's going on.
By the way, we were having lots of problems with Torque
and NFS, but after configuring torque as recommended in http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html, those problems went away and our reliability improved dramatically. Now all that remains are the two occasional problems above.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers