[torqueusers] Jobs in Waiting state with post_modify_req: PBSE_UNKJOBID

Tomas Kouba koubat at fzu.cz
Thu Jan 28 08:48:47 MST 2010


Hello all,

we have recently obtained new worker nodes with many new job slots and
we see quite strange behaviour. 

The problems are:
- A set of nodes (which are not special in any kind of way) is always empty.
- Some jobs get assigned to these nodes but become Waiting in torque

I have dig thru maui, torque and mom logs. As an example I can get a job
1072470:

[root at torque ~]# tracejob 1072470
/var/spool/torque/mom_logs/20100128: No such file or directory
/var/spool/torque/sched_logs/20100128: No such file or directory

Job: 1072470.torque.farm.particle.cz

01/28/2010 13:15:36  S    enqueuing into d0prod, state 1 hop 1
01/28/2010 13:15:36  S    Job Queued at request of samgrid at sam2.farm.particle.cz, owner = samgrid at sam2.farm.particle.cz, job name = Z254407765, queue = d0prod
01/28/2010 13:15:36  A    queue=d0prod
01/28/2010 13:36:51  S    Job Modified at request of root at torque.farm.particle.cz
01/28/2010 13:36:51  S    Job Run at request of root at torque.farm.particle.cz
01/28/2010 13:36:51  S    Job Modified at request of root at torque.farm.particle.cz
01/28/2010 13:36:51  S    post_modify_req: PBSE_UNKJOBID for job 1072470.torque.farm.particle.cz in state RUNNING-STAGEGO, dest = saltix14
01/28/2010 14:07:11  S    Job Modified at request of root at torque.farm.particle.cz
01/28/2010 14:07:11  S    Job Run at request of root at torque.farm.particle.cz
01/28/2010 14:07:11  S    Job Modified at request of root at torque.farm.particle.cz
01/28/2010 14:07:11  S    post_modify_req: PBSE_UNKJOBID for job 1072470.torque.farm.particle.cz in state RUNNING-STAGEGO, dest = saltix14
01/28/2010 14:31:28  S    Job Run at request of root at torque.farm.particle.cz
01/28/2010 14:54:28  S    Job Modified at request of root at torque.farm.particle.cz
01/28/2010 14:54:28  S    Job Run at request of root at torque.farm.particle.cz
01/28/2010 14:54:28  S    Job Modified at request of root at torque.farm.particle.cz
01/28/2010 14:54:28  S    post_modify_req: PBSE_UNKJOBID for job 1072470.torque.farm.particle.cz in state RUNNING-STAGEGO, dest = saltix15
01/28/2010 15:10:13  S    Job Run at request of root at torque.farm.particle.cz
01/28/2010 15:10:17  A    user=samgrid group=samgrid jobname=Z254407765 queue=d0prod ctime=1264680936 qtime=1264680936 etime=0 start=1264687817 owner=samgrid at sam2.farm.particle.cz exec_host=salix02/5 Resource_List.neednodes=1
                          Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=72:00:00


This shows that the job is tried on saltix14, again saltix14, saltix15 and finally it is successfully started on salix02.

Maui shows a similar records:

maui.log.3:01/28 13:36:51 MPBSJobModify(1072470,Resource_List,Resource,saltix14)
maui.log.2:01/28 14:07:11 MPBSJobModify(1072470,Resource_List,Resource,saltix14)
maui.log.1:01/28 14:54:28 MPBSJobModify(1072470,Resource_List,Resource,saltix15)
maui.log.1:01/28 15:36:43 INFO:     cannot locate PBS job '1072470.torque.farm.particle.cz' (running on node salix02)

mom log on saltix14 shows this:

01/28/2010 13:36:51;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=saltix14.farm.particle.cz MSG=modify job failed, unknown job 1072470.torque.farm.particle.cz), aux=0, type=ModifyJob, from PBS_Server at torque.farm.particle.cz
01/28/2010 14:07:11;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=saltix14.farm.particle.cz MSG=modify job failed, unknown job 1072470.torque.farm.particle.cz), aux=0, type=ModifyJob, from PBS_Server at torque.farm.particle.cz

My theory is that the job is somehow planned on saltix14 but not submitted in real. 
Then torque tries to start it but mom does not know about the job and rejects it.

My testing jobs are executed just fine (and I cannot identify the difference between rejected jobs and my jobs). 

What are other ways to find out why job is in 'Waiting' state?

Thank you for any help.

Best regards,

-- 
Tomas Kouba
Institute of Physics, Academy of sciences of the Czech Republic


More information about the torqueusers mailing list