[torqueusers] Troubleshooting qsub -I interactive queue jobs

Atwood, Robert C r.atwood at imperial.ac.uk
Wed Sep 5 07:35:47 MDT 2007


 Yes, batch jobs appear to work correctly. 
Trace job does not return anything that explains to me what happened,
the job ran and exited with -1 status. Symptoms from the console running
were simply nothing, just the messages as captured with no input command
line available between messages.

I believe you are correct that it's a nameserver issue, but I don't see
why the node loses track of the address of the master just on the
internal network. After all, the master is stuffed into each node's
/etc/hosts as well.

So, I am going to ask questions to  those who can hopefully help sort
out the dhcp ; any torque-specific information about how the name
resolution affects this kind of job would still be helpful. I think the
diagnostic output contains insufficient information to find out what the
problem actually is,however! Happy to be shown otherwise.
Robert





>>>>>>>>>>> node /etc/hosts file <<<<<<<<<<<<<<<<<
127.0.0.1       localhost.localdomain localhost
10.141.255.254  master.beowulf.cluster master hive2 mt-hive2

>>>>>>>>>>>>> command line caputre <<<<<<<<<<<<<<<<<<<<<< 

> 		~> qsub -I -l nodes=node04.beowulf.cluster
> 		qsub: waiting for job 
> 12892.mt-hive2.mt.ic.ac.uk to start
> 		qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted

>>>>>>>>> tracejob -n 5 results <<<<<<<<<<<<<<<<<<<<<<<<<<<<
Job: 12892.mt-hive2.mt.ic.ac.uk

09/03/2007 13:44:32  S    enqueuing into default, state 1 hop 1
09/03/2007 13:44:32  S    dequeuing from default, state QUEUED
09/03/2007 13:44:32  S    enqueuing into short, state 1 hop 1
09/03/2007 13:44:32  S    Job Queued at request of
rcatwood at mt-hive2.mt.ic.ac.uk, owner =
                          rcatwood at mt-hive2.mt.ic.ac.uk, job name =
STDIN, queue = short
09/03/2007 13:44:32  S    Job Modified at request of
Scheduler at mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:32  S    Job Run at request of
Scheduler at mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:32  A    queue=default
09/03/2007 13:44:32  A    queue=short
09/03/2007 13:44:37  L    Job Run
09/03/2007 13:44:37  A    user=rcatwood group=pg jobname=STDIN
queue=short ctime=1188823472 qtime=1188823472
                          etime=1188823472 start=1188823477
exec_host=node04.beowulf.cluster/0
                          Resource_List.neednodes=node04.beowulf.cluster
Resource_List.nice=16
                          Resource_List.nodect=1
Resource_List.nodes=node04.beowulf.cluster
                          Resource_List.walltime=04:00:00
09/03/2007 13:44:52  S    Exit_status=-1 resources_used.cput=00:00:00
resources_used.mem=0kb
                          resources_used.vmem=0kb
resources_used.walltime=00:00:20
09/03/2007 13:44:52  S    on_job_exit valid pjob: 0x594790 (substate=50)
09/03/2007 13:44:52  S    on_job_exit valid pjob: 0x594790 (substate=52)
09/03/2007 13:44:52  S    on_job_exit valid pjob: 0x594790 (substate=53)
09/03/2007 13:44:52  S    dequeuing from short, state COMPLETE
09/03/2007 13:44:52  A    user=rcatwood group=pg jobname=STDIN
queue=short ctime=1188823472 qtime=1188823472
                          etime=1188823472 start=1188823477
exec_host=node04.beowulf.cluster/0
                          Resource_List.neednodes=node04.beowulf.cluster
Resource_List.nice=16
                          Resource_List.nodect=1
Resource_List.nodes=node04.beowulf.cluster
                          Resource_List.walltime=04:00:00 session=0
end=1188823492 Exit_status=-1
                          resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:20

> -----Original Message-----
> From: Aaron Knister [mailto:aaron at iges.org] 
> Sent: 04 September 2007 01:45
> To: Atwood, Robert C
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Troubleshooting qsub -I 
> interactive queue jobs
> 
> Batch jobs run fine, you say?
> 
> Also can you run tracejob jobid on an interactive job you 
> tried (that failed).
> 
> -Aaron
> 
> On Sep 3, 2007, at 11:17 AM, Atwood, Robert C wrote:
> 
> 
> 	Hi:
> 
> 	Yes, the dhcp client was overwriting the 
> /etc/resolv.conf I tried
> 	stuffing everything in /etc/hosts. But that is not 
> working. Perhaps I
> 	have done it incorrectly . But I am able to login to 
> nodes by name, and
> 	log from node to master either as 'master' or as its 
> hostname. However,
> 	I cannot log to an outside machine from the node using 
> the hostname, I
> 	also need to solve that (not for Torque though, but if 
> you recommmend a
> 	document to read I would be grateful, I have not 
> actually had to deal
> 	with this before since the initial configuration JUST 
> WORKED until now,
> 	and on the previous cluster) 
> 
> 
> 	With this situation I have the behaviour I described in 
> the previous
> 	message. 
> 
> 	Thanks again
> 	Robert
> 
> 
> 	I have done the files like so. 
> 
> 	/etc/hosts:
> 
> 
> 	127.0.0.1               localhost.beowulf.cluster localhost
> 	10.141.255.254          master.beowulf.cluster master
> 
> 	10.141.0.1             node01 node01.beowulf.cluster
> 	(etc) 
> 
> 	/var/spool/torque/server_priv/nodes:
> 
> 	node01 np=2 x11
> 	(etc)
> 
> 
> 
> 
> 		-----Original Message-----
> 		From: Aaron Knister [mailto:aaron at iges.org] 
> 		Sent: 03 September 2007 15:47
> 		To: Atwood, Robert C
> 		Cc: torqueusers at supercluster.org
> 		Subject: Re: [torqueusers] Troubleshooting qsub -I 
> 		interactive queue jobs
> 
> 		How is the cluster doing name resolution? Have 
> you stuffed 
> 		everything in /etc/hosts or is there a local dns server 
> 		running? If you're running a local dns server 
> check to make 
> 		sure the dhcp client hasn't overwritten your 
> local dns server 
> 		in /etc/resolv.conf.
> 
> 		-Aaron
> 
> 		On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:
> 
> 
> 		Hi,
> 		For some reason jobs using qsub -I are immediately 
> 		exiting. Until very
> 		recently this was not happening, -I jobs worked 
> 		correctly. The main
> 		change is that the outside network has changed requirng 
> 		us to run dhcp
> 		client for the outside network. I am not sure why this 
> 		should affect the
> 		cluster network but that's all I can think of that's 
> 		different from just
> 		a few days ago when this was working.
> 		I have looked at MOM log on the node and server log on 
> 		the master
> 		(appended below), I don't see what is wrong, it just 
> 		says 'Failure job
> 		exec failure'? What does this mean and how may I find 
> 		out what is
> 		causing it?
> 
> 
> 		Thanks
> 		Robert
> 
> 
> 
> 
> 
> 		
> 		
> 
> 		command line capture <<<<<<<<<<<<<<<<
> 
> 		~> qsub -I -l nodes=node04.beowulf.cluster
> 		qsub: waiting for job 
> 12892.mt-hive2.mt.ic.ac.uk to start
> 		qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
> 
> 
> 
> 		
> 		
> 
> 		mom_log on node04 
> 		<<<<<<<<<<<<<<<<<<<<
> 
> 		09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type QueueJob 
> 		request received
> 		from PBS_Server at master.beowulf.cluster, sock=10
> 		09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type 
> 		ReadyToCommit request
> 		received from PBS_Server at master.beowulf.cluster, sock=10
> 		09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type Commit 
> 		request received
> 		from PBS_Server at master.beowulf.cluster, sock=10
> 		09/03/2007 13:44:39;0100;   pbs_mom;Req;;Type StatusJob 
> 		request received
> 		from PBS_Server at master.beowulf.cluster, sock=11
> 		09/03/2007 13:44:52;0001;   
> pbs_mom;Job;TMomFinalizeJob3;job not
> 		started, Failure job exec failure, before files 
> staged, no retry
> 		09/03/2007 13:44:52;0008;   
> 		pbs_mom;Req;send_sisters;sending ABORT to
> 		sisters
> 		09/03/2007 13:44:52;0100;   pbs_mom;Req;;Type DeleteJob 
> 		request received
> 		from PBS_Server at master.beowulf.cluster, sock=10
> 
> 
> 
> 		
> 		
> 
> 		
> 		
> 
> 		server log on master <<<<<<<<<<<<<<<<<<<<
> 
> 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> 		AuthenticateUser request
> 		received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob 
> 		request received
> 		from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> 		ReadyToCommit request
> 		received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit 
> 		request received
> 		from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> 		default, state 1 hop 1
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> 		default, state QUEUED
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> 		short, state 1 hop 1
> 		09/03/2007 
> 		
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> 		Queued at request of 
> rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> 		rcatwood at mt-hive2.mt.ic.ac.uk, job name = 
> STDIN, queue = short
> 		09/03/2007 
> 		
> 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> 		sent command new
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> 		StatusServer request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> StatusNode request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> 		StatusQueue request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> 		ResourceQuery request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 
> 		
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> 		Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 
> 		
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> 		Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 		09/03/2007 
> 		
> 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> 		sent command recyc
> 		09/03/2007 13:44:50;0100;PBS_Server;Req;;Type 
> 		AuthenticateUser request
> 		received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:44:50;0100;PBS_Server;Req;;Type 
> 		StatusServer request
> 		received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 		09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob 
> 		request received
> 		from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> 		JobObituary request
> 		received from pbs_mom at node04, sock=9
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
> 		resources_used.cput=00:00:00 resources_used.mem=0kb
> 		resources_used.vmem=0kb resources_used.walltime=00:00:20
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> 		valid pjob: 0x594790 (substate=50)
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> 		valid pjob: 0x594790 (substate=52)
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> 		valid pjob: 0x594790 (substate=53)
> 		09/03/2007
> 
> 		
> 		
> 
> 		
> 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> 		short, state COMPLETE
> 		09/03/2007 
> 		
> 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> 		sent command term
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> 		StatusServer request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> StatusNode request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> 		StatusQueue request
> 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> 		request received
> 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 		09/03/2007 13:45:02;0100;PBS_Server;Req;;Type 
> 		AuthenticateUser request
> 		received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 		09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob 
> 		request received
> 		from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 		09/03/2007 
> 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
> 		code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> 		rcatwood at mt-hive2.mt.ic.ac.uk
> 		_______________________________________________
> 		torqueusers mailing list
> 		torqueusers at supercluster.org
> 		http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 		Aaron Knister
> 		Associate Systems Administrator/Web Designer
> 		Center for Research on Environment and Water
> 
> 		(301) 595-7001
> 		aaron at iges.org
> 
> 
> 
> 
> 
> 
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
> 
> (301) 595-7001
> aaron at iges.org
> 
> 
> 
> 


More information about the torqueusers mailing list