[torqueusers] Troubleshooting qsub -I interactive queue jobs

Atwood, Robert C r.atwood at imperial.ac.uk
Mon Sep 3 06:56:59 MDT 2007


Hi,
For some reason jobs using qsub -I are immediately exiting. Until very
recently this was not happening, -I jobs worked correctly. The main
change is that the outside network has changed requirng us to run dhcp
client for the outside network. I am not sure why this should affect the
cluster network but that's all I can think of that's different from just
a few days ago when this was working.
I have looked at MOM log on the node and server log on the master
(appended below), I don't see what is wrong, it just says 'Failure job
exec failure'? What does this mean and how may I find out what is
causing it?


Thanks
Robert



>>>>>>>>>> command line capture <<<<<<<<<<<<<<<<
 ~> qsub -I -l nodes=node04.beowulf.cluster
qsub: waiting for job 12892.mt-hive2.mt.ic.ac.uk to start
qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted

>>>>>>>>>>>> mom_log on node04 <<<<<<<<<<<<<<<<<<<<
09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type QueueJob request received
from PBS_Server at master.beowulf.cluster, sock=10
09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at master.beowulf.cluster, sock=10
09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type Commit request received
from PBS_Server at master.beowulf.cluster, sock=10
09/03/2007 13:44:39;0100;   pbs_mom;Req;;Type StatusJob request received
from PBS_Server at master.beowulf.cluster, sock=11
09/03/2007 13:44:52;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, before files staged, no retry
09/03/2007 13:44:52;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters
09/03/2007 13:44:52;0100;   pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at master.beowulf.cluster, sock=10

>>>>>>>>>>>>>>>> server log on master <<<<<<<<<<<<<<<<<<<<
09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type AuthenticateUser request
received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob request received
from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ReadyToCommit request
received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit request received
from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
09/03/2007
13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
default, state 1 hop 1
09/03/2007
13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
default, state QUEUED
09/03/2007
13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
short, state 1 hop 1
09/03/2007 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
Queued at request of rcatwood at mt-hive2.mt.ic.ac.uk, owner =
rcatwood at mt-hive2.mt.ic.ac.uk, job name = STDIN, queue = short
09/03/2007 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
sent command new
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
sent command recyc
09/03/2007 13:44:50;0100;PBS_Server;Req;;Type AuthenticateUser request
received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusServer request
received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob request received
from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at node04, sock=9
09/03/2007
13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:20
09/03/2007
13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
valid pjob: 0x594790 (substate=50)
09/03/2007
13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
valid pjob: 0x594790 (substate=52)
09/03/2007
13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
valid pjob: 0x594790 (substate=53)
09/03/2007
13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
short, state COMPLETE
09/03/2007 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
sent command term
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
09/03/2007 13:45:02;0100;PBS_Server;Req;;Type AuthenticateUser request
received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob request received
from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
09/03/2007 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id), aux=0, type=LocateJob, from
rcatwood at mt-hive2.mt.ic.ac.uk


More information about the torqueusers mailing list