[torqueusers] qrerun fails due to Unauthorized Request

Robert Oostenveld r.oostenveld at donders.ru.nl
Wed Nov 16 01:52:20 MST 2011


Dear torque users,

I am trying to use qrerun in a shell script to deal with the (potential) limit in available MATLAB licenses. Let me shortly outline the idea before explaining the problem.

I have a shell script that starts MATLAB with the "-r <filename>" option for a MATLAB script. In case there is no license available, MATLAB returns immediately with a descriptive error about the license failure. I would like to catch that error and if it happens, issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job for execution at a later time. Note that I am aware of the ability to configure floating resources in moab, but I am using maui. Furthermore, the floating resources for the Matlab license don't optimally represent the license requirements for scheduling multiple jobs by the same user on a multicore machine. Hence I prefer to use qrerun instead of making the license a managed resource.

The problem I run into can be summarized in the following snippet from the command line. I schedule a simple job that subsequenty starts running on one of the execution hosts:

roboos at mentat001> echo sleep 1000 | qsub
45254.dccn-l014.dccn.nl

Then I try to use qrerun, first as regular user then as super user (which I normally would not do of course):

roboos at mentat001> qrerun 45254
qrerun: Unauthorized Request  45254.dccn-l014.dccn.nl
roboos at mentat001> sudo qrerun 45254
qrerun: Unauthorized Request  MSG=operation not permitted 45254.dccn-l014.dccn.nl

So as root/administrative user I am also not allowed to do it from the client machine. I am able to log in directly on the torque server, where as regular user I am also not allowed to qrerun. It is not a general failure of qrerun, since the the root user on the torque server is allowed to use it:

roboos at mentat001> ssh torque
roboos at torque> qrerun 45254
qrerun: Unauthorized Request  45254.dccn-l014.dccn.nl
roboos at torque> sudo qrerun 45254

after which the job is correctly requeued and starts over again.

To provide some info from the log files: as regular user I get the following message in /var/spool/torque/server_logs

11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply code=15018(Request invalid for state of job), aux=0, type=RerunJob, from roboos at mentat001.dccn.nl

and as root on the torque server I get

11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply code=15018(Request invalid for state of job), aux=0, type=RerunJob, from root at dccn-l014.dccn.nl

The log mesaage is basically the same. In the log message on the execution host I cannot find anything that pertains to the failed qrerun request.

Does anyone have an idea on what might be the problem for the regular user not being allowed to restart the job? I tried the same thing on a different torque cluster (not managed by me) that I have access to, and also there it failed.


best regards,
Robert



-----------------------------------------------------------
Robert Oostenveld, PhD
Senior Researcher & MEG Physicist
Donders Institute for Brain, Cognition and Behaviour
Centre for Cognitive Neuroimaging
Radboud University Nijmegen
tel.: +31 (0)24 3619695
e-mail: r.oostenveld at donders.ru.nl
web: http://www.ru.nl/neuroimaging
skype: r.oostenveld
-----------------------------------------------------------






More information about the torqueusers mailing list