[torqueusers] intermittent qsub failures

Craig Artley cartley at hotmail.com
Tue Nov 19 19:20:46 MST 2013


I am seeing intermittent qsub failures. It seems to be related to load --- several hundred jobs submitted. Every once in a while, qsub fails with "Unknown Job Id Error" or "can not locate new job":

    Exit code = 153
    Error: qsub: submit error (Unknown Job Id Error)

    Exit code = 196
    Error: qsub: submit error (Invalid request MSG=can not locate new job 630254.h2 (0 - Success))

In the server log, I find messages like these:

11/19/2013 01:16:42;0080;PBS_Server.27108;Job;625027.h2;Unknown Job Id Error

11/19/2013
 01:16:42;0080;PBS_Server.27108;Req;req_reject;Reject reply 
code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0, 
type=DeleteJob, from joeuser at g4


11/19/2013 14:41:44;0001;PBS_Server.29564;Svr;PBS_Server;LOG_ERROR::Invalid request (15004) in req_jobscript, can not locate new job 630254.h2 (0 - Success)
11/19/2013 14:41:44;0100;PBS_Server.27141;Job;630253.h2;enqueuing into parallel, state 1 hop 1
11/19/2013 14:41:44;0080;PBS_Server.29564;Req;req_reject;Reject reply code=15004(Invalid request MSG=can not locate new job 630254.h2 (0 - Success)), aux=0, type=JobScript, from joeuser at g4

So far I haven't found anything helpful. Please let me know if you 
have idea what's going on.

By the way, we were having lots of problems with Torque 
and NFS, but after configuring torque as recommended in http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html, those problems went away and our reliability improved dramatically. Now all that remains are the two occasional problems above.

  -craig
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131119/62dc3a2f/attachment.html 


More information about the torqueusers mailing list