[torqueusers] qrerun fails due to Unauthorized Request

Robert Oostenveld r.oostenveld at donders.ru.nl
Mon Nov 21 14:57:41 MST 2011


Dear David,

No, my regular account is not a manager, but root is. I had expected qrerun to work for a users' own jobs, and I had expected it to work from a submit client, not only on the torque server. I cannot make all ~200 regular users manager (imagine one of them doing "qdel all" on his runaway jobs), and I don't want to give them access to the server. My conclusion at the moment is that qrerun cannot be used to restart a matlab job at a later time in case the licences run out. 

Is there an overview somewhere on what commands the regular user can use? The man pages don't provide this information, and the error message is not very informative. 

The qrerun command and many others are installed by the torque-clients package in the bin directory; would it not be more appropriate to install it in sbin and only install it with the torque-server package?

best regards,
Robert



On 18 Nov 2011, at 17:59, David Beer wrote:

> Are the super user and or your user at that box managers on pbs_server? You would need manager privileges to qrerun a job.
> 
> David
> 
> ----- Original Message -----
>> Dear torque users,
>> 
>> I am trying to use qrerun in a shell script to deal with the
>> (potential) limit in available MATLAB licenses. Let me shortly
>> outline the idea before explaining the problem.
>> 
>> I have a shell script that starts MATLAB with the "-r <filename>"
>> option for a MATLAB script. In case there is no license available,
>> MATLAB returns immediately with a descriptive error about the
>> license failure. I would like to catch that error and if it happens,
>> issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job
>> for execution at a later time. Note that I am aware of the ability
>> to configure floating resources in moab, but I am using maui.
>> Furthermore, the floating resources for the Matlab license don't
>> optimally represent the license requirements for scheduling multiple
>> jobs by the same user on a multicore machine. Hence I prefer to use
>> qrerun instead of making the license a managed resource.
>> 
>> The problem I run into can be summarized in the following snippet
>> from the command line. I schedule a simple job that subsequenty
>> starts running on one of the execution hosts:
>> 
>> roboos at mentat001> echo sleep 1000 | qsub
>> 45254.dccn-l014.dccn.nl
>> 
>> Then I try to use qrerun, first as regular user then as super user
>> (which I normally would not do of course):
>> 
>> roboos at mentat001> qrerun 45254
>> qrerun: Unauthorized Request  45254.dccn-l014.dccn.nl
>> roboos at mentat001> sudo qrerun 45254
>> qrerun: Unauthorized Request  MSG=operation not permitted
>> 45254.dccn-l014.dccn.nl
>> 
>> So as root/administrative user I am also not allowed to do it from
>> the client machine. I am able to log in directly on the torque
>> server, where as regular user I am also not allowed to qrerun. It is
>> not a general failure of qrerun, since the the root user on the
>> torque server is allowed to use it:
>> 
>> roboos at mentat001> ssh torque
>> roboos at torque> qrerun 45254
>> qrerun: Unauthorized Request  45254.dccn-l014.dccn.nl
>> roboos at torque> sudo qrerun 45254
>> 
>> after which the job is correctly requeued and starts over again.
>> 
>> To provide some info from the log files: as regular user I get the
>> following message in /var/spool/torque/server_logs
>> 
>> 11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15018(Request invalid for state of job), aux=0, type=RerunJob,
>> from roboos at mentat001.dccn.nl
>> 
>> and as root on the torque server I get
>> 
>> 11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15018(Request invalid for state of job), aux=0, type=RerunJob,
>> from root at dccn-l014.dccn.nl
>> 
>> The log mesaage is basically the same. In the log message on the
>> execution host I cannot find anything that pertains to the failed
>> qrerun request.
>> 
>> Does anyone have an idea on what might be the problem for the regular
>> user not being allowed to restart the job? I tried the same thing on
>> a different torque cluster (not managed by me) that I have access
>> to, and also there it failed.
>> 
>> 
>> best regards,
>> Robert
>> 
>> 
>> 
>> -----------------------------------------------------------
>> Robert Oostenveld, PhD
>> Senior Researcher & MEG Physicist
>> Donders Institute for Brain, Cognition and Behaviour
>> Centre for Cognitive Neuroimaging
>> Radboud University Nijmegen
>> tel.: +31 (0)24 3619695
>> e-mail: r.oostenveld at donders.ru.nl
>> web: http://www.ru.nl/neuroimaging
>> skype: r.oostenveld
>> -----------------------------------------------------------
>> 
>> 
>> 
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> 
> -- 
> David Beer 
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1712 S East Bay Blvd, Suite 300
>     Provo, UT 84606
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list