[torqueusers] Troubleshooting qsub -I interactive queue jobs

James J Coyle jjc at iastate.edu
Wed Sep 5 10:14:58 MDT 2007


To get /etc/resolv.conf to take precedence you'll need to change the file
/etc/nsswitch.conf

  Try changing the entry for hosts in the file /etc/nsswitch.conf to:

hosts:      files dns

  This will read your /etc/hosts file first when resolving names.
You may need to reboot for this to take effect.


  A better solution would be have correct settings in /etc/resolv.conf
by either
1) Stopping dhcp from over-writing /etc/resolv.conf in the first place
  or
2) Over-writing /etc/resolv.conf on reboots yourself by:
    A)  Making a correct  /etc/resolv.conf  
    B)  Copying this to  /etc/resolv.conf.correct on each node.
    C)  On each node edit the file /etc/rc3.d/S99local  to append
              the command /bin/cp -f /etc/resolv.conf.correct /etc/resolv.conf 
       (I am assuming that you come up in run level 3 and S99local is the
          last rc script run.)

  It is possible that you may need to your network after getting a 
correct resolv.conf in place. That is likely done by
 
/sbin/service network restart

or by invoking the correct rc3.d script with an argument of restart.

> 
>  Yes, batch jobs appear to work correctly. 
> Trace job does not return anything that explains to me what happened,
> the job ran and exited with -1 status. Symptoms from the console running
> were simply nothing, just the messages as captured with no input command
> line available between messages.
> 
> I believe you are correct that it's a nameserver issue, but I don't see
> why the node loses track of the address of the master just on the
> internal network. After all, the master is stuffed into each node's
> /etc/hosts as well.
> 
> So, I am going to ask questions to  those who can hopefully help sort
> out the dhcp ; any torque-specific information about how the name
> resolution affects this kind of job would still be helpful. I think the
> diagnostic output contains insufficient information to find out what the
> problem actually is,however! Happy to be shown otherwise.
> Robert
> 
> 
> 
> 
> 
> >>>>>>>>>>> node /etc/hosts file <<<<<<<<<<<<<<<<<
> 127.0.0.1       localhost.localdomain localhost
> 10.141.255.254  master.beowulf.cluster master hive2 mt-hive2
> 
> >>>>>>>>>>>>> command line caputre <<<<<<<<<<<<<<<<<<<<<< 
> 
> > 		~> qsub -I -l nodes=node04.beowulf.cluster
> > 		qsub: waiting for job 
> > 12892.mt-hive2.mt.ic.ac.uk to start
> > 		qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
> 
> >>>>>>>>> tracejob -n 5 results <<<<<<<<<<<<<<<<<<<<<<<<<<<<
> Job: 12892.mt-hive2.mt.ic.ac.uk
> 
> 09/03/2007 13:44:32  S    enqueuing into default, state 1 hop 1
> 09/03/2007 13:44:32  S    dequeuing from default, state QUEUED
> 09/03/2007 13:44:32  S    enqueuing into short, state 1 hop 1
> 09/03/2007 13:44:32  S    Job Queued at request of
> rcatwood at mt-hive2.mt.ic.ac.uk, owner =
>                           rcatwood at mt-hive2.mt.ic.ac.uk, job name =
> STDIN, queue = short
> 09/03/2007 13:44:32  S    Job Modified at request of
> Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:32  S    Job Run at request of
> Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:32  A    queue=default
> 09/03/2007 13:44:32  A    queue=short
> 09/03/2007 13:44:37  L    Job Run
> 09/03/2007 13:44:37  A    user=rcatwood group=pg jobname=STDIN
> queue=short ctime=1188823472 qtime=1188823472
>                           etime=1188823472 start=1188823477
> exec_host=node04.beowulf.cluster/0
>                           Resource_List.neednodes=node04.beowulf.cluster
> Resource_List.nice=16
>                           Resource_List.nodect=1
> Resource_List.nodes=node04.beowulf.cluster
>                           Resource_List.walltime=04:00:00
> 09/03/2007 13:44:52  S    Exit_status=-1 resources_used.cput=00:00:00
> resources_used.mem=0kb
>                           resources_used.vmem=0kb
> resources_used.walltime=00:00:20
> 09/03/2007 13:44:52  S    on_job_exit valid pjob: 0x594790 (substate=50)
> 09/03/2007 13:44:52  S    on_job_exit valid pjob: 0x594790 (substate=52)
> 09/03/2007 13:44:52  S    on_job_exit valid pjob: 0x594790 (substate=53)
> 09/03/2007 13:44:52  S    dequeuing from short, state COMPLETE
> 09/03/2007 13:44:52  A    user=rcatwood group=pg jobname=STDIN
> queue=short ctime=1188823472 qtime=1188823472
>                           etime=1188823472 start=1188823477
> exec_host=node04.beowulf.cluster/0
>                           Resource_List.neednodes=node04.beowulf.cluster
> Resource_List.nice=16
>                           Resource_List.nodect=1
> Resource_List.nodes=node04.beowulf.cluster
>                           Resource_List.walltime=04:00:00 session=0
> end=1188823492 Exit_status=-1
>                           resources_used.cput=00:00:00
> resources_used.mem=0kb resources_used.vmem=0kb
>                           resources_used.walltime=00:00:20
> 
> > -----Original Message-----
> > From: Aaron Knister [mailto:aaron at iges.org] 
> > Sent: 04 September 2007 01:45
> > To: Atwood, Robert C
> > Cc: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] Troubleshooting qsub -I 
> > interactive queue jobs
> > 
> > Batch jobs run fine, you say?
> > 
> > Also can you run tracejob jobid on an interactive job you 
> > tried (that failed).
> > 
> > -Aaron
> > 
> > On Sep 3, 2007, at 11:17 AM, Atwood, Robert C wrote:
> > 
> > 
> > 	Hi:
> > 
> > 	Yes, the dhcp client was overwriting the 
> > /etc/resolv.conf I tried
> > 	stuffing everything in /etc/hosts. But that is not 
> > working. Perhaps I
> > 	have done it incorrectly . But I am able to login to 
> > nodes by name, and
> > 	log from node to master either as 'master' or as its 
> > hostname. However,
> > 	I cannot log to an outside machine from the node using 
> > the hostname, I
> > 	also need to solve that (not for Torque though, but if 
> > you recommmend a
> > 	document to read I would be grateful, I have not 
> > actually had to deal
> > 	with this before since the initial configuration JUST 
> > WORKED until now,
> > 	and on the previous cluster) 
> > 
> > 
> > 	With this situation I have the behaviour I described in 
> > the previous
> > 	message. 
> > 
> > 	Thanks again
> > 	Robert
> > 
> > 
> > 	I have done the files like so. 
> > 
> > 	/etc/hosts:
> > 
> > 
> > 	127.0.0.1               localhost.beowulf.cluster localhost
> > 	10.141.255.254          master.beowulf.cluster master
> > 
> > 	10.141.0.1             node01 node01.beowulf.cluster
> > 	(etc) 
> > 
> > 	/var/spool/torque/server_priv/nodes:
> > 
> > 	node01 np=2 x11
> > 	(etc)
> > 
> > 
> > 
> > 
> > 		-----Original Message-----
> > 		From: Aaron Knister [mailto:aaron at iges.org] 
> > 		Sent: 03 September 2007 15:47
> > 		To: Atwood, Robert C
> > 		Cc: torqueusers at supercluster.org
> > 		Subject: Re: [torqueusers] Troubleshooting qsub -I 
> > 		interactive queue jobs
> > 
> > 		How is the cluster doing name resolution? Have 
> > you stuffed 
> > 		everything in /etc/hosts or is there a local dns server 
> > 		running? If you're running a local dns server 
> > check to make 
> > 		sure the dhcp client hasn't overwritten your 
> > local dns server 
> > 		in /etc/resolv.conf.
> > 
> > 		-Aaron
> > 
> > 		On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:
> > 
> > 
> > 		Hi,
> > 		For some reason jobs using qsub -I are immediately 
> > 		exiting. Until very
> > 		recently this was not happening, -I jobs worked 
> > 		correctly. The main
> > 		change is that the outside network has changed requirng 
> > 		us to run dhcp
> > 		client for the outside network. I am not sure why this 
> > 		should affect the
> > 		cluster network but that's all I can think of that's 
> > 		different from just
> > 		a few days ago when this was working.
> > 		I have looked at MOM log on the node and server log on 
> > 		the master
> > 		(appended below), I don't see what is wrong, it just 
> > 		says 'Failure job
> > 		exec failure'? What does this mean and how may I find 
> > 		out what is
> > 		causing it?
> > 
> > 
> > 		Thanks
> > 		Robert
> > 
> > 
> > 
> > 
> > 
> > 		
> > 		
> > 
> > 		command line capture <<<<<<<<<<<<<<<<
> > 
> > 		~> qsub -I -l nodes=node04.beowulf.cluster
> > 		qsub: waiting for job 
> > 12892.mt-hive2.mt.ic.ac.uk to start
> > 		qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
> > 
> > 
> > 
> > 		
> > 		
> > 
> > 		mom_log on node04 
> > 		<<<<<<<<<<<<<<<<<<<<
> > 
> > 		09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type QueueJob 
> > 		request received
> > 		from PBS_Server at master.beowulf.cluster, sock=10
> > 		09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type 
> > 		ReadyToCommit request
> > 		received from PBS_Server at master.beowulf.cluster, sock=10
> > 		09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type Commit 
> > 		request received
> > 		from PBS_Server at master.beowulf.cluster, sock=10
> > 		09/03/2007 13:44:39;0100;   pbs_mom;Req;;Type StatusJob 
> > 		request received
> > 		from PBS_Server at master.beowulf.cluster, sock=11
> > 		09/03/2007 13:44:52;0001;   
> > pbs_mom;Job;TMomFinalizeJob3;job not
> > 		started, Failure job exec failure, before files 
> > staged, no retry
> > 		09/03/2007 13:44:52;0008;   
> > 		pbs_mom;Req;send_sisters;sending ABORT to
> > 		sisters
> > 		09/03/2007 13:44:52;0100;   pbs_mom;Req;;Type DeleteJob 
> > 		request received
> > 		from PBS_Server at master.beowulf.cluster, sock=10
> > 
> > 
> > 
> > 		
> > 		
> > 
> > 		
> > 		
> > 
> > 		server log on master <<<<<<<<<<<<<<<<<<<<
> > 
> > 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> > 		AuthenticateUser request
> > 		received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob 
> > 		request received
> > 		from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> > 		ReadyToCommit request
> > 		received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit 
> > 		request received
> > 		from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> > 		default, state 1 hop 1
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> > 		default, state QUEUED
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> > 		short, state 1 hop 1
> > 		09/03/2007 
> > 		
> > 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> > 		Queued at request of 
> > rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> > 		rcatwood at mt-hive2.mt.ic.ac.uk, job name = 
> > STDIN, queue = short
> > 		09/03/2007 
> > 		
> > 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> > 		sent command new
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> > 		StatusServer request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> > StatusNode request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> > 		StatusQueue request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type 
> > 		ResourceQuery request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 
> > 		
> > 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> > 		Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
> > 		09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 
> > 		
> > 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> > 		Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
> > 		09/03/2007 
> > 		
> > 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> > 		sent command recyc
> > 		09/03/2007 13:44:50;0100;PBS_Server;Req;;Type 
> > 		AuthenticateUser request
> > 		received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:44:50;0100;PBS_Server;Req;;Type 
> > 		StatusServer request
> > 		received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> > 		09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob 
> > 		request received
> > 		from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> > 		JobObituary request
> > 		received from pbs_mom at node04, sock=9
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
> > 		resources_used.cput=00:00:00 resources_used.mem=0kb
> > 		resources_used.vmem=0kb resources_used.walltime=00:00:20
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> > 		valid pjob: 0x594790 (substate=50)
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> > 		valid pjob: 0x594790 (substate=52)
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> > 		valid pjob: 0x594790 (substate=53)
> > 		09/03/2007
> > 
> > 		
> > 		
> > 
> > 		
> > 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> > 		short, state COMPLETE
> > 		09/03/2007 
> > 		
> > 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> > 		sent command term
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> > 		StatusServer request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> > StatusNode request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type 
> > 		StatusQueue request
> > 		received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat 
> > 		request received
> > 		from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 		09/03/2007 13:45:02;0100;PBS_Server;Req;;Type 
> > 		AuthenticateUser request
> > 		received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> > 		09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob 
> > 		request received
> > 		from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 		09/03/2007 
> > 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
> > 		code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > 		rcatwood at mt-hive2.mt.ic.ac.uk
> > 		_______________________________________________
> > 		torqueusers mailing list
> > 		torqueusers at supercluster.org
> > 		http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > 
> > 		Aaron Knister
> > 		Associate Systems Administrator/Web Designer
> > 		Center for Research on Environment and Water
> > 
> > 		(301) 595-7001
> > 		aaron at iges.org
> > 
> > 
> > 
> > 
> > 
> > 
> > Aaron Knister
> > Associate Systems Administrator/Web Designer
> > Center for Research on Environment and Water
> > 
> > (301) 595-7001
> > aaron at iges.org
> > 
> > 
> > 
> > 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 






More information about the torqueusers mailing list