[torqueusers] intermittent qsub failures

Craig Artley cartley at hotmail.com
Wed Nov 20 19:56:42 MST 2013


This is with 4.1.6.

What version / branch do you recommend?

  -craig

________________________________
> Date: Wed, 20 Nov 2013 12:10:02 -0700 
> From: dbeer at adaptivecomputing.com 
> To: torqueusers at supercluster.org 
> Subject: Re: [torqueusers] intermittent qsub failures 
> 
> What version are you getting this error on? We had a related fix recently. 
> 
> 
> On Tue, Nov 19, 2013 at 7:20 PM, Craig Artley 
> <cartley at hotmail.com<mailto:cartley at hotmail.com>> wrote: 
> I am seeing intermittent qsub failures. It seems to be related to load 
> --- several hundred jobs submitted. Every once in a while, qsub fails 
> with "Unknown Job Id Error" or "can not locate new job": 
> 
> Exit code = 153 
> Error: qsub: submit error (Unknown Job Id Error) 
> 
> Exit code = 196 
> Error: qsub: submit error (Invalid request MSG=can not locate new 
> job 630254.h2 (0 - Success)) 
> 
> In the server log, I find messages like these: 
> 
> 11/19/2013 01:16:42;0080;PBS_Server.27108;Job;625027.h2;Unknown Job Id Error 
> 
> 11/19/2013 01:16:42;0080;PBS_Server.27108;Req;req_reject;Reject reply 
> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0, 
> type=DeleteJob, from joeuser at g4 
> 
> 
> 11/19/2013 
> 14:41:44;0001;PBS_Server.29564;Svr;PBS_Server;LOG_ERROR::Invalid 
> request (15004) in req_jobscript, can not locate new job 630254.h2 (0 - 
> Success) 
> 11/19/2013 14:41:44;0100;PBS_Server.27141;Job;630253.h2;enqueuing into 
> parallel, state 1 hop 1 
> 11/19/2013 14:41:44;0080;PBS_Server.29564;Req;req_reject;Reject reply 
> code=15004(Invalid request MSG=can not locate new job 630254.h2 (0 - 
> Success)), aux=0, type=JobScript, from joeuser at g4 
> 
> So far I haven't found anything helpful. Please let me know if you have 
> idea what's going on. 
> 
> By the way, we were having lots of problems with Torque and NFS, but 
> after configuring torque as recommended in 
> http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html, 
> those problems went away and our reliability improved dramatically. Now 
> all that remains are the two occasional problems above. 
> 
> -craig 
> 
> _______________________________________________ 
> torqueusers mailing list 
> torqueusers at supercluster.org<mailto:torqueusers at supercluster.org> 
> http://www.supercluster.org/mailman/listinfo/torqueusers 
> 
> 
> 
> 
> -- 
> David Beer | Senior Software Engineer 
> Adaptive Computing 
> 
> _______________________________________________ torqueusers mailing 
> list torqueusers at supercluster.org 
> http://www.supercluster.org/mailman/listinfo/torqueusers 		 	   		  


More information about the torqueusers mailing list