[torqueusers] Troubleshooting qsub -I interactive queue jobs

Aaron Knister aaron at iges.org
Mon Sep 3 08:47:22 MDT 2007


How is the cluster doing name resolution? Have you stuffed everything  
in /etc/hosts or is there a local dns server running? If you're  
running a local dns server check to make sure the dhcp client hasn't  
overwritten your local dns server in /etc/resolv.conf.

-Aaron

On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:

> Hi,
> For some reason jobs using qsub -I are immediately exiting. Until very
> recently this was not happening, -I jobs worked correctly. The main
> change is that the outside network has changed requirng us to run dhcp
> client for the outside network. I am not sure why this should  
> affect the
> cluster network but that's all I can think of that's different from  
> just
> a few days ago when this was working.
> I have looked at MOM log on the node and server log on the master
> (appended below), I don't see what is wrong, it just says 'Failure job
> exec failure'? What does this mean and how may I find out what is
> causing it?
>
>
> Thanks
> Robert
>
>
>
>>>>>>>>>>> command line capture <<<<<<<<<<<<<<<<
>  ~> qsub -I -l nodes=node04.beowulf.cluster
> qsub: waiting for job 12892.mt-hive2.mt.ic.ac.uk to start
> qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
>
>>>>>>>>>>>>> mom_log on node04 <<<<<<<<<<<<<<<<<<<<
> 09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type QueueJob request  
> received
> from PBS_Server at master.beowulf.cluster, sock=10
> 09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type ReadyToCommit request
> received from PBS_Server at master.beowulf.cluster, sock=10
> 09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type Commit request received
> from PBS_Server at master.beowulf.cluster, sock=10
> 09/03/2007 13:44:39;0100;   pbs_mom;Req;;Type StatusJob request  
> received
> from PBS_Server at master.beowulf.cluster, sock=11
> 09/03/2007 13:44:52;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, before files staged, no retry
> 09/03/2007 13:44:52;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 09/03/2007 13:44:52;0100;   pbs_mom;Req;;Type DeleteJob request  
> received
> from PBS_Server at master.beowulf.cluster, sock=10
>
>>>>>>>>>>>>>>>>> server log on master <<<<<<<<<<<<<<<<<<<<
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob request  
> received
> from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ReadyToCommit request
> received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit request received
> from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> default, state 1 hop 1
> 09/03/2007
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> default, state QUEUED
> 09/03/2007
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> short, state 1 hop 1
> 09/03/2007 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> Queued at request of rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> rcatwood at mt-hive2.mt.ic.ac.uk, job name = STDIN, queue = short
> 09/03/2007 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> sent command new
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusServer request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusNode request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusQueue request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ResourceQuery request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob request  
> received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> sent command recyc
> 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusServer request
> received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob request  
> received
> from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type JobObituary request
> received from pbs_mom at node04, sock=9
> 09/03/2007
> 13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
> resources_used.cput=00:00:00 resources_used.mem=0kb
> resources_used.vmem=0kb resources_used.walltime=00:00:20
> 09/03/2007
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> valid pjob: 0x594790 (substate=50)
> 09/03/2007
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> valid pjob: 0x594790 (substate=52)
> 09/03/2007
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> valid pjob: 0x594790 (substate=53)
> 09/03/2007
> 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> short, state COMPLETE
> 09/03/2007 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> sent command term
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusServer request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusNode request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusQueue request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:45:02;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob request  
> received
> from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> rcatwood at mt-hive2.mt.ic.ac.uk
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070903/ee371308/attachment-0001.html


More information about the torqueusers mailing list