[torqueusers] intermittent qsub failures

Craig Artley cartley at hotmail.com
Thu Nov 21 12:08:04 MST 2013


That's great, thanks very much for the information! We'll look for this soon, and make the move from 4.1.x to 4.2.x.

  -craig

Date: Thu, 21 Nov 2013 08:17:22 -0700
From: dbeer at adaptivecomputing.com
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] intermittent qsub failures

I thought that we had this fixed in 4.2.6, but it looks like the fix is currently only in Jarvik. We can get this released with 4.2.7.


On Wed, Nov 20, 2013 at 7:56 PM, Craig Artley <cartley at hotmail.com> wrote:

This is with 4.1.6.



What version / branch do you recommend?



  -craig



________________________________

> Date: Wed, 20 Nov 2013 12:10:02 -0700

> From: dbeer at adaptivecomputing.com

> To: torqueusers at supercluster.org

> Subject: Re: [torqueusers] intermittent qsub failures

>

> What version are you getting this error on? We had a related fix recently.

>

>

> On Tue, Nov 19, 2013 at 7:20 PM, Craig Artley

> <cartley at hotmail.com<mailto:cartley at hotmail.com>> wrote:

> I am seeing intermittent qsub failures. It seems to be related to load

> --- several hundred jobs submitted. Every once in a while, qsub fails

> with "Unknown Job Id Error" or "can not locate new job":

>

> Exit code = 153

> Error: qsub: submit error (Unknown Job Id Error)

>

> Exit code = 196

> Error: qsub: submit error (Invalid request MSG=can not locate new

> job 630254.h2 (0 - Success))

>

> In the server log, I find messages like these:

>

> 11/19/2013 01:16:42;0080;PBS_Server.27108;Job;625027.h2;Unknown Job Id Error

>

> 11/19/2013 01:16:42;0080;PBS_Server.27108;Req;req_reject;Reject reply

> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,

> type=DeleteJob, from joeuser at g4

>

>

> 11/19/2013

> 14:41:44;0001;PBS_Server.29564;Svr;PBS_Server;LOG_ERROR::Invalid

> request (15004) in req_jobscript, can not locate new job 630254.h2 (0 -

> Success)

> 11/19/2013 14:41:44;0100;PBS_Server.27141;Job;630253.h2;enqueuing into

> parallel, state 1 hop 1

> 11/19/2013 14:41:44;0080;PBS_Server.29564;Req;req_reject;Reject reply

> code=15004(Invalid request MSG=can not locate new job 630254.h2 (0 -

> Success)), aux=0, type=JobScript, from joeuser at g4

>

> So far I haven't found anything helpful. Please let me know if you have

> idea what's going on.

>

> By the way, we were having lots of problems with Torque and NFS, but

> after configuring torque as recommended in

> http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html,

> those problems went away and our reliability improved dramatically. Now

> all that remains are the two occasional problems above.

>

> -craig

>

> _______________________________________________

> torqueusers mailing list

> torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>

> http://www.supercluster.org/mailman/listinfo/torqueusers

>

>

>

>

> --

> David Beer | Senior Software Engineer

> Adaptive Computing

>

> _______________________________________________ torqueusers mailing

> list torqueusers at supercluster.org

> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________

torqueusers mailing list

torqueusers at supercluster.org

http://www.supercluster.org/mailman/listinfo/torqueusers



-- 
David Beer | Senior Software EngineerAdaptive Computing


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131121/b287e5e9/attachment-0001.html 


More information about the torqueusers mailing list