[torqueusers] Troubleshooting qsub -I interactive queue jobs

Atwood, Robert C r.atwood at imperial.ac.uk
Mon Sep 3 09:17:30 MDT 2007


Hi:

Yes, the dhcp client was overwriting the /etc/resolv.conf I tried
stuffing everything in /etc/hosts. But that is not working. Perhaps I
have done it incorrectly . But I am able to login to nodes by name, and
log from node to master either as 'master' or as its hostname. However,
I cannot log to an outside machine from the node using the hostname, I
also need to solve that (not for Torque though, but if you recommmend a
document to read I would be grateful, I have not actually had to deal
with this before since the initial configuration JUST WORKED until now,
and on the previous cluster) 


With this situation I have the behaviour I described in the previous
message. 

Thanks again
Robert


I have done the files like so. 

/etc/hosts:


127.0.0.1               localhost.beowulf.cluster localhost
10.141.255.254          master.beowulf.cluster master

10.141.0.1             node01 node01.beowulf.cluster
(etc) 

/var/spool/torque/server_priv/nodes:

node01 np=2 x11
(etc)



> -----Original Message-----
> From: Aaron Knister [mailto:aaron at iges.org] 
> Sent: 03 September 2007 15:47
> To: Atwood, Robert C
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Troubleshooting qsub -I 
> interactive queue jobs
> 
> How is the cluster doing name resolution? Have you stuffed 
> everything in /etc/hosts or is there a local dns server 
> running? If you're running a local dns server check to make 
> sure the dhcp client hasn't overwritten your local dns server 
> in /etc/resolv.conf.
> 
> -Aaron
> 
> On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:
> 
> 
> 	Hi,
> 	For some reason jobs using qsub -I are immediately 
> exiting. Until very
> 	recently this was not happening, -I jobs worked 
> correctly. The main
> 	change is that the outside network has changed requirng 
> us to run dhcp
> 	client for the outside network. I am not sure why this 
> should affect the
> 	cluster network but that's all I can think of that's 
> different from just
> 	a few days ago when this was working.
> 	I have looked at MOM log on the node and server log on 
> the master
> 	(appended below), I don't see what is wrong, it just 
> says 'Failure job
> 	exec failure'? What does this mean and how may I find 
> out what is
> 	causing it?
> 
> 
> 	Thanks
> 	Robert
> 
> 
> 
> 
> 								
> 			command line capture <<<<<<<<<<<<<<<<
> 
> 	 ~> qsub -I -l nodes=node04.beowulf.cluster
> 	qsub: waiting for job 12892.mt-hive2.mt.ic.ac.uk to start
> 	qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
> 
> 
> 								
> 					mom_log on node04 
> <<<<<<<<<<<<<<<<<<<<
> 
> 	09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type QueueJob 
> request received
> 	from PBS_Server at master.beowulf.cluster, sock=10
> 	09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type 
> ReadyToCommit request
> 	received from PBS_Server at master.beowulf.cluster, sock=10
> 	09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type Commit 
> request received
> 	from PBS_Server at master.beowulf.cluster, sock=10
> 	09/03/2007 13:44:39;0100;   pbs_mom;Req;;Type StatusJob 
> request received
> 	from PBS_Server at master.beowulf.cluster, sock=11
> 	09/03/2007 13:44:52;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> 	started, Failure job exec failure, before files staged, no retry
> 	09/03/2007 13:44:52;0008;   
> pbs_mom;Req;send_sisters;sending ABORT to
> 	sisters
> 	09/03/2007 13:44:52;0100;   pbs_mom;Req;;Type DeleteJob 
> request received
> 	from PBS_Server at master.beowulf.cluster, sock=10
> 
> 
> 								
> 								
> 	server log on master <<<<<<<<<<<<<<<<<<<<
> 
> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> AuthenticateUser request
> 	received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob 
> request received
> 	from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> ReadyToCommit request
> 	received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit 
> request received
> 	from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 	09/03/2007
> 	
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> 	default, state 1 hop 1
> 	09/03/2007
> 	
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> 	default, state QUEUED
> 	09/03/2007
> 	
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> 	short, state 1 hop 1
> 	09/03/2007 
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> 	Queued at request of rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> 	rcatwood at mt-hive2.mt.ic.ac.uk, job name = STDIN, queue = short
> 	09/03/2007 
> 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> 	sent command new
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> StatusServer request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusNode request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> StatusQueue request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> ResourceQuery request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> 	Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> 	Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 	09/03/2007 
> 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> 	sent command recyc
> 	09/03/2007 13:44:50;0100;PBS_Server;Req;;Type 
> AuthenticateUser request
> 	received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:44:50;0100;PBS_Server;Req;;Type 
> StatusServer request
> 	received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 	09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob 
> request received
> 	from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> JobObituary request
> 	received from pbs_mom at node04, sock=9
> 	09/03/2007
> 	
> 13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
> 	resources_used.cput=00:00:00 resources_used.mem=0kb
> 	resources_used.vmem=0kb resources_used.walltime=00:00:20
> 	09/03/2007
> 	
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> 	valid pjob: 0x594790 (substate=50)
> 	09/03/2007
> 	
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> 	valid pjob: 0x594790 (substate=52)
> 	09/03/2007
> 	
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> 	valid pjob: 0x594790 (substate=53)
> 	09/03/2007
> 	
> 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> 	short, state COMPLETE
> 	09/03/2007 
> 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> 	sent command term
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> StatusServer request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusNode request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> StatusQueue request
> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> request received
> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 	09/03/2007 13:45:02;0100;PBS_Server;Req;;Type 
> AuthenticateUser request
> 	received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 	09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob 
> request received
> 	from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 	09/03/2007 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
> 	code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> 	rcatwood at mt-hive2.mt.ic.ac.uk
> 	_______________________________________________
> 	torqueusers mailing list
> 	torqueusers at supercluster.org
> 	http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
> 
> (301) 595-7001
> aaron at iges.org
> 
> 
> 
> 


More information about the torqueusers mailing list