[torqueusers] Multi node jobs fail on startup and pbs_mom cannot clear the stale job. (torque 4.2.6)

Roy Dragseth roy.dragseth at cc.uit.no
Sat Dec 28 17:01:02 MST 2013


We have an issue on two clusters using torque 4.2.6 where jobs fail on 
startup with an RMfailure and just disappears from the queue. 
Subsequently it is not possible to clear the stale jobs from the sister 
nodes the MS denies any knowledge about said job.

Transcripts are truncated for clarity.
Complete transcripts of the logs can be found on pastebin: 
http://pastebin.com/XNcK7kWq

First a multi-node job fails on initialization:

# tracejob -n 10 1407069 | cut -c -$COLUMNS

Job: 1407069.gardar-adm.nhpc.hi.is

12/25/2013 12:39:07  S    enqueuing into default, state 1 hop 1
12/25/2013 12:39:07  A    queue=default
12/25/2013 12:39:08  S    Job Run at request of root at gardar-adm.nhpc.hi.is
12/25/2013 12:39:08  S    Not sending email: User does not want mail of 
this type.
12/25/2013 12:39:08  A    user=userx group=userx account=snic025-12-36 
jobname=start_run.csh queue=default ctime=1387975147 qtime=1387975147 
etime=1387975147
12/25/2013 12:47:26  S    Job Run at request of root at gardar-adm.nhpc.hi.is
12/25/2013 12:47:26  S    Not sending email: User does not want mail of 
this type.
12/25/2013 12:47:26  A    user=userx group=userx account=snic025-12-36 
jobname=start_run.csh queue=default ctime=1387975147 qtime=1387975147 
etime=1387975147
12/25/2013 12:52:25  S    Exit_status=0 resources_used.cput=00:00:00 
resources_used.mem=0kb resources_used.vmem=0kb 
resources_used.walltime=00:00:00
12/25/2013 12:52:25  S    Not sending email: User does not want mail of 
this type.
12/25/2013 12:52:25  S    on_job_exit valid pjob: 
1407069.gardar-adm.nhpc.hi.is (substate=50)
12/25/2013 12:52:25  A    user=userx group=userx account=snic025-12-36 
jobname=start_run.csh queue=default ctime=1387975147 qtime=1387975147 
etime=1387975147
12/25/2013 12:54:45  S    dequeuing from default, state COMPLETE


The mom log on MS contains

# grep -A5 -B5 1407069 20131225 | head -n 50
12/25/2013 12:29:07;0002;   pbs_mom.2473;Svr;pbs_mom;Torque Mom Version 
= 4.2.6, loglevel = 0
12/25/2013 12:34:07;0002;   pbs_mom.2473;Svr;pbs_mom;Torque Mom Version 
= 4.2.6, loglevel = 0
12/25/2013 12:39:07;0002;   pbs_mom.2473;Svr;pbs_mom;Torque Mom Version 
= 4.2.6, loglevel = 0
12/25/2013 12:44:07;0002;   pbs_mom.2473;Svr;pbs_mom;Torque Mom Version 
= 4.2.6, loglevel = 0
12/25/2013 12:44:09;0008; 
pbs_mom.2473;Job;resend_waiting_joins;Successfully re-sent join job 
request to compute-3-11
12/25/2013 12:47:24;0001;   pbs_mom.2473;Job;exec_bail;bailing on job 
1407069.gardar-adm.nhpc.hi.is code -3
12/25/2013 12:47:24;0008;   pbs_mom.2473;Req;send_sisters;sending ABORT 
to sisters for job 1407069.gardar-adm.nhpc.hi.is
12/25/2013 12:47:25;0001; pbs_mom.2473;Svr;pbs_mom;LOG_ERROR::exec_bail, 
exec_bail: sent 46 ABORT requests, should be 47
12/25/2013 12:47:25;0001; pbs_mom.2473;Job;im_request;10.128.1.44:785 
sent an abort. Killing job 1407069.gardar-adm.nhpc.hi.is
12/25/2013 12:47:25;0080;   pbs_mom.2473;Svr;preobit_preparation;top
12/25/2013 12:47:25;0080; 
pbs_mom.2473;Job;1407069.gardar-adm.nhpc.hi.is;obit sent to server
12/25/2013 12:47:25;0080; 
pbs_mom.2473;Job;1407069.gardar-adm.nhpc.hi.is;removed job script
12/25/2013 12:49:07;0002;   pbs_mom.2473;Svr;pbs_mom;Torque Mom Version 
= 4.2.6, loglevel = 0
12/25/2013 12:52:25;0080; 
pbs_mom.2473;Job;1407069.gardar-adm.nhpc.hi.is;obit sent to server
12/25/2013 12:52:25;0001;   pbs_mom.2473;Req;obit_reply;Job not found 
for obit reply
12/25/2013 12:52:27;0008; 
pbs_mom.2473;Job;resend_waiting_joins;Successfully re-sent join job 
request to compute-3-11
12/25/2013 12:52:35;0080; 
pbs_mom.2473;Job;1407069.gardar-adm.nhpc.hi.is;removed job script
12/25/2013 12:54:07;0002;   pbs_mom.2473;Svr;pbs_mom;Torque Mom Version 
= 4.2.6, loglevel = 0
12/25/2013 12:55:43;0008; 
pbs_mom.2473;Job;1407069.gardar-adm.nhpc.hi.is;ERROR: received request 
'ABORT_JOB' from 10.128.1.96:15003 for job 
'1407069.gardar-adm.nhpc.hi.is' cookie 
'9227726684FAC8F4CB91CFAB615ADA59' event '0' (job does not exist locally).
12/25/2013 12:55:43;0008; 
pbs_mom.2473;Job;1407069.gardar-adm.nhpc.hi.is;ERROR: received request 
'ABORT_JOB' from 10.128.1.46:15003 for job 
'1407069.gardar-adm.nhpc.hi.is' cookie '9227726

And the log reports a failed ABORT_JOB for every second the rest of the 
day and onwards until we reinstall the node.

The users just get an RMfailure and no stdout/stderr report for the job.

The problem seems to escalate and the syslog is exploding with messages 
like these as more and more jobs have this problem

Dec 27 22:07:59 compute-7-8 pbs_mom: LOG_ERROR::im_request, KILL_JOB 
ERROR and I'm not MS
Dec 27 22:07:59 compute-7-8 pbs_mom: LOG_ERROR::im_request, error 
processing command 99 event_com 2 for job 1407069.gardar-adm.nhpc.hi.is 
from 10.128.1.6:268:(15003)


Any clues on how to fix this?

Regards,
r.



More information about the torqueusers mailing list