[torqueusers] intermittent qsub failures

Jagga Soorma jagga13 at gmail.com
Wed Nov 20 13:12:25 MST 2013


I am using version torque-server-2.5.13-1.  Is there a updated version that
I should be using or a fix that I could apply and test?

Thanks,
-J


On Wed, Nov 20, 2013 at 11:10 AM, David Beer <dbeer at adaptivecomputing.com>wrote:

> What version are you getting this error on? We had a related fix recently.
>
>
> On Tue, Nov 19, 2013 at 7:20 PM, Craig Artley <cartley at hotmail.com> wrote:
>
>> I am seeing intermittent qsub failures. It seems to be related to load
>> --- several hundred jobs submitted. Every once in a while, qsub fails with
>> "Unknown Job Id Error" or "can not locate new job":
>>
>>     Exit code = 153
>>     Error: qsub: submit error (Unknown Job Id Error)
>>
>>     Exit code = 196
>>     Error: qsub: submit error (Invalid request MSG=can not locate new job
>> 630254.h2 (0 - Success))
>>
>> In the server log, I find messages like these:
>>
>> 11/19/2013 01:16:42;0080;PBS_Server.27108;Job;625027.h2;Unknown Job Id
>> Error
>>
>> 11/19/2013 01:16:42;0080;PBS_Server.27108;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
>> type=DeleteJob, from joeuser at g4
>>
>>
>> 11/19/2013
>> 14:41:44;0001;PBS_Server.29564;Svr;PBS_Server;LOG_ERROR::Invalid request
>> (15004) in req_jobscript, can not locate new job 630254.h2 (0 - Success)
>> 11/19/2013 14:41:44;0100;PBS_Server.27141;Job;630253.h2;enqueuing into
>> parallel, state 1 hop 1
>> 11/19/2013 14:41:44;0080;PBS_Server.29564;Req;req_reject;Reject reply
>> code=15004(Invalid request MSG=can not locate new job 630254.h2 (0 -
>> Success)), aux=0, type=JobScript, from joeuser at g4
>>
>> So far I haven't found anything helpful. Please let me know if you have
>> idea what's going on.
>>
>> By the way, we were having lots of problems with Torque and NFS, but
>> after configuring torque as recommended in
>> http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html,
>> those problems went away and our reliability improved dramatically. Now all
>> that remains are the two occasional problems above.
>>
>>   -craig
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131120/16bcdb4a/attachment.html 


More information about the torqueusers mailing list