[torqueusers] blocked jobs in maui, waiting jobs in torque
Jan Svec
jeniks at gmail.com
Wed Feb 3 15:21:35 MST 2010
Hi all,
I'm having problem with newly installed torque/maui system - a lot of jobs
fails to run. They get assigned to the node but then become Waiting in
torque.
I searched through maui, torque server and mom logs and found this (this is
one of many failing jobs):
showq:
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
1159111 samgrid Hold 1 3:00:00:00 Wed Feb 3
22:29:25
[root at torque ~]# tracejob 1159111
/var/spool/torque/mom_logs/
20100203: No such file or directory
/var/spool/torque/sched_logs/20100203: No such file or directory
Job: 1159111.torque.farm.particle.cz
02/03/2010 22:29:25 S enqueuing into d0prod, state 1 hop 1
02/03/2010 22:29:25 S Job Queued at request of
samgrid at sam2.farm.particle.cz, owner = samgrid at sam2.farm.particle.cz, job
name = Z063015370, queue = d0prod
02/03/2010 22:29:25 A queue=d0prod
02/03/2010 22:30:00 S Job Modified at request of
root at torque.farm.particle.cz
02/03/2010 22:30:00 S Job Run at request of root at torque.farm.particle.cz
02/03/2010 22:30:00 S Job Modified at request of
root at torque.farm.particle.cz
02/03/2010 22:30:00 S post_modify_req: PBSE_UNKJOBID for job
1159111.torque.farm.particle.cz in state RUNNING-STAGEGO, dest = salix37
[root at torque ~]# grep 1159111 /usr/local/maui/log/maui.log
02/03 22:57:30 MJobFind('1159111',J,0)
02/03 22:57:30 MRMJobPreUpdate(1159111)
02/03 22:57:30 MPBSJobUpdate(1159111,1159111.torque.farm.particle.cz
,TaskList,0)
02/03 22:57:30 __MPBSGetTaskList(1159111,1,TaskList,0)
02/03 22:57:30 INFO: job 1159111 starttime: 1265232592 (00:27:31)
presenttime: 1265234243 wclimit: 259200 mtime: 1265232600 etime: 0
walltime: 0 state: Hold
02/03 22:57:30 MRMJobPostUpdate(1159111,TaskList,Hold,base)
02/03 22:57:30 INFO: job '1159111' Priority: 1
02/03 22:57:30 INFO: job '1159111' priority: 1.00
02/03 22:57:31 INFO: job '1159111' Priority: 1
02/03 22:57:31 INFO: job '1159111' priority: 1.00
02/03 22:58:02 INFO: line: ' 1159111 samgrid 1265232592
1265232565 1 259200 - 6 1
02/03 22:58:39 MJobFind('1159111',J,0)
02/03 22:58:39 MRMJobPreUpdate(1159111)
02/03 22:58:39 MPBSJobUpdate(1159111,1159111.torque.farm.particle.cz
,TaskList,0)
02/03 22:58:39 __MPBSGetTaskList(1159111,1,TaskList,0)
02/03 22:58:39 INFO: job 1159111 starttime: 1265232592 (00:28:40)
presenttime: 1265234312 wclimit: 259200 mtime: 1265232600 etime: 0
walltime: 0 state: Hold
02/03 22:58:39 MRMJobPostUpdate(1159111,TaskList,Hold,base)
02/03 22:58:40 INFO: job '1159111' Priority: 1
02/03 22:58:40 INFO: job '1159111' priority: 1.00
02/03 22:58:40 INFO: job '1159111' Priority: 1
02/03 22:58:40 INFO: job '1159111' priority: 1.00
[root at torque ~]# ssh salix37 "grep 1159111 /var/spool/torque/mom_logs/*"
/var/spool/torque/mom_logs/20100203:02/03/2010 22:30:00;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=
salix37.farm.particle.cz MSG=modify job failed, unknown job
1159111.torque.farm.particle.cz), aux=0, type=ModifyJob, from
PBS_Server at torque.farm.particle.cz
I think the problem is somehow connected with the PBSE_UNKJOBID error, but I
didn't found any solution. To me it seems strange, that the pbs_mom is
staging in files, but doesn't know the job...
Thank you for any help.
Best regards,
Jan Svec
Institute of Physics AS CR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100203/ea138619/attachment.html
More information about the torqueusers
mailing list