[torqueusers] all jobs get stuck in Q

Guy Tsafnat guyt at unsw.edu.au
Sat Mar 1 13:45:25 MST 2014

We had a harddrive crash on our machine that runs torque/maui. no data was lost but the machine had to reboot. After reboot, pbs_server and maui start without warning or errors but all jobs remain in queue with 'Unauthorized Request  MSG=operation not permitted'. All these jobs used to run before the crash. Older jobs (i.e. ones started before the crash) don't seem to be affected. Any help appreciated.

Some output:

[root at red2 jobs]# qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1762.red2                  ...tterDataSVMIn xujuan          297:17:1 R batch          
1763.red2                  ...terDataSVMOut xujuan          294:49:4 R batch          
1768.red2                  ...erDataSVMInV6 xujuan          76:01:14 R batch          
1770.red2                  ...9F0D2FEE7894B tomcat                 0 Q batch          
1771.red2                  ...DE87821F4A264 tomcat                 0 Q batch          
1772.red2                  run-attacca.q    apache                 0 Q batch          
1773.red2                  e.make_run       guyt                   0 Q batch

[root at red2 jobs]# checkjob 1773

checking job 1773

State: Idle  EState: Deferred
Creds:  user:guyt  group:[DEFAULT]  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 20:00:00:00
SubmitTime: Sat Mar  1 18:38:33
  (Time Queued  Total: 12:34:25  Eligible: 00:00:00)

StartDate: -00:28:11  Sun Mar  2 06:44:47
Total Tasks: 24

Req[0]  TaskCount: 24  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 13
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 15007, msg: 'Unauthorized Request  MSG=operation not permitted')
Holds:    Defer  (hold reason:  RMFailure)
PE:  24.00  StartPriority:  28
cannot select job 1773 for partition DEFAULT (job hold active)

