[torqueusers] blocked jobs in maui, waiting jobs in torque

Jan Svec jeniks at gmail.com
Wed Feb 3 15:21:35 MST 2010


Hi all,

I'm having problem with newly installed torque/maui system - a lot of jobs
fails to run. They get assigned to the node but then become Waiting in
torque.

I searched through maui, torque server and mom logs and found this (this is
one of many failing jobs):

showq:
BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME

1159111             samgrid       Hold     1  3:00:00:00  Wed Feb  3
22:29:25

[root at torque ~]# tracejob 1159111
/var/spool/torque/mom_logs/
20100203: No such file or directory
/var/spool/torque/sched_logs/20100203: No such file or directory

Job: 1159111.torque.farm.particle.cz

02/03/2010 22:29:25  S    enqueuing into d0prod, state 1 hop 1
02/03/2010 22:29:25  S    Job Queued at request of
samgrid at sam2.farm.particle.cz, owner = samgrid at sam2.farm.particle.cz, job
name = Z063015370, queue = d0prod
02/03/2010 22:29:25  A    queue=d0prod
02/03/2010 22:30:00  S    Job Modified at request of
root at torque.farm.particle.cz
02/03/2010 22:30:00  S    Job Run at request of root at torque.farm.particle.cz
02/03/2010 22:30:00  S    Job Modified at request of
root at torque.farm.particle.cz
02/03/2010 22:30:00  S    post_modify_req: PBSE_UNKJOBID for job
1159111.torque.farm.particle.cz in state RUNNING-STAGEGO, dest = salix37

[root at torque ~]# grep 1159111 /usr/local/maui/log/maui.log
02/03 22:57:30 MJobFind('1159111',J,0)
02/03 22:57:30 MRMJobPreUpdate(1159111)
02/03 22:57:30 MPBSJobUpdate(1159111,1159111.torque.farm.particle.cz
,TaskList,0)
02/03 22:57:30 __MPBSGetTaskList(1159111,1,TaskList,0)
02/03 22:57:30 INFO:     job 1159111 starttime: 1265232592 (00:27:31)
presenttime: 1265234243  wclimit: 259200  mtime: 1265232600  etime: 0
walltime: 0  state: Hold
02/03 22:57:30 MRMJobPostUpdate(1159111,TaskList,Hold,base)
02/03 22:57:30 INFO:     job '1159111' Priority:        1
02/03 22:57:30 INFO:     job '1159111'  priority:     1.00
02/03 22:57:31 INFO:     job '1159111' Priority:        1
02/03 22:57:31 INFO:     job '1159111'  priority:     1.00
02/03 22:58:02 INFO:     line: '         1159111  samgrid 1265232592
1265232565    1 259200 -  6   1
02/03 22:58:39 MJobFind('1159111',J,0)
02/03 22:58:39 MRMJobPreUpdate(1159111)
02/03 22:58:39 MPBSJobUpdate(1159111,1159111.torque.farm.particle.cz
,TaskList,0)
02/03 22:58:39 __MPBSGetTaskList(1159111,1,TaskList,0)
02/03 22:58:39 INFO:     job 1159111 starttime: 1265232592 (00:28:40)
presenttime: 1265234312  wclimit: 259200  mtime: 1265232600  etime: 0
walltime: 0  state: Hold
02/03 22:58:39 MRMJobPostUpdate(1159111,TaskList,Hold,base)
02/03 22:58:40 INFO:     job '1159111' Priority:        1
02/03 22:58:40 INFO:     job '1159111'  priority:     1.00
02/03 22:58:40 INFO:     job '1159111' Priority:        1
02/03 22:58:40 INFO:     job '1159111'  priority:     1.00

[root at torque ~]# ssh salix37 "grep 1159111 /var/spool/torque/mom_logs/*"
/var/spool/torque/mom_logs/20100203:02/03/2010 22:30:00;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=
salix37.farm.particle.cz MSG=modify job failed, unknown job
1159111.torque.farm.particle.cz), aux=0, type=ModifyJob, from
PBS_Server at torque.farm.particle.cz

I think the problem is somehow connected with the PBSE_UNKJOBID error, but I
didn't found any solution. To me it seems strange, that the pbs_mom is
staging in files, but doesn't know the job...

Thank you for any help.

Best regards,
Jan Svec
Institute of Physics AS CR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100203/ea138619/attachment.html 


More information about the torqueusers mailing list