[torqueusers] Troubleshooting qsub -I interactive queue jobs

Aaron Knister aaron at iges.org
Mon Sep 3 18:45:29 MDT 2007


Batch jobs run fine, you say?

Also can you run tracejob jobid on an interactive job you tried (that  
failed).

-Aaron

On Sep 3, 2007, at 11:17 AM, Atwood, Robert C wrote:

> Hi:
>
> Yes, the dhcp client was overwriting the /etc/resolv.conf I tried
> stuffing everything in /etc/hosts. But that is not working. Perhaps I
> have done it incorrectly . But I am able to login to nodes by name,  
> and
> log from node to master either as 'master' or as its hostname.  
> However,
> I cannot log to an outside machine from the node using the hostname, I
> also need to solve that (not for Torque though, but if you  
> recommmend a
> document to read I would be grateful, I have not actually had to deal
> with this before since the initial configuration JUST WORKED until  
> now,
> and on the previous cluster)
>
>
> With this situation I have the behaviour I described in the previous
> message.
>
> Thanks again
> Robert
>
>
> I have done the files like so.
>
> /etc/hosts:
>
>
> 127.0.0.1               localhost.beowulf.cluster localhost
> 10.141.255.254          master.beowulf.cluster master
>
> 10.141.0.1             node01 node01.beowulf.cluster
> (etc)
>
> /var/spool/torque/server_priv/nodes:
>
> node01 np=2 x11
> (etc)
>
>
>
>> -----Original Message-----
>> From: Aaron Knister [mailto:aaron at iges.org]
>> Sent: 03 September 2007 15:47
>> To: Atwood, Robert C
>> Cc: torqueusers at supercluster.org
>> Subject: Re: [torqueusers] Troubleshooting qsub -I
>> interactive queue jobs
>>
>> How is the cluster doing name resolution? Have you stuffed
>> everything in /etc/hosts or is there a local dns server
>> running? If you're running a local dns server check to make
>> sure the dhcp client hasn't overwritten your local dns server
>> in /etc/resolv.conf.
>>
>> -Aaron
>>
>> On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:
>>
>>
>> 	Hi,
>> 	For some reason jobs using qsub -I are immediately
>> exiting. Until very
>> 	recently this was not happening, -I jobs worked
>> correctly. The main
>> 	change is that the outside network has changed requirng
>> us to run dhcp
>> 	client for the outside network. I am not sure why this
>> should affect the
>> 	cluster network but that's all I can think of that's
>> different from just
>> 	a few days ago when this was working.
>> 	I have looked at MOM log on the node and server log on
>> the master
>> 	(appended below), I don't see what is wrong, it just
>> says 'Failure job
>> 	exec failure'? What does this mean and how may I find
>> out what is
>> 	causing it?
>>
>>
>> 	Thanks
>> 	Robert
>>
>>
>>
>>
>> 								
>> 			command line capture <<<<<<<<<<<<<<<<
>>
>> 	 ~> qsub -I -l nodes=node04.beowulf.cluster
>> 	qsub: waiting for job 12892.mt-hive2.mt.ic.ac.uk to start
>> 	qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
>>
>>
>> 								
>> 					mom_log on node04
>> <<<<<<<<<<<<<<<<<<<<
>>
>> 	09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type QueueJob
>> request received
>> 	from PBS_Server at master.beowulf.cluster, sock=10
>> 	09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type
>> ReadyToCommit request
>> 	received from PBS_Server at master.beowulf.cluster, sock=10
>> 	09/03/2007 13:44:32;0100;   pbs_mom;Req;;Type Commit
>> request received
>> 	from PBS_Server at master.beowulf.cluster, sock=10
>> 	09/03/2007 13:44:39;0100;   pbs_mom;Req;;Type StatusJob
>> request received
>> 	from PBS_Server at master.beowulf.cluster, sock=11
>> 	09/03/2007 13:44:52;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
>> 	started, Failure job exec failure, before files staged, no retry
>> 	09/03/2007 13:44:52;0008;
>> pbs_mom;Req;send_sisters;sending ABORT to
>> 	sisters
>> 	09/03/2007 13:44:52;0100;   pbs_mom;Req;;Type DeleteJob
>> request received
>> 	from PBS_Server at master.beowulf.cluster, sock=10
>>
>>
>> 								
>> 								
>> 	server log on master <<<<<<<<<<<<<<<<<<<<
>>
>> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
>> AuthenticateUser request
>> 	received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob
>> request received
>> 	from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
>> ReadyToCommit request
>> 	received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit
>> request received
>> 	from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
>> 	09/03/2007
>> 	
>> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing  
>> into
>> 	default, state 1 hop 1
>> 	09/03/2007
>> 	
>> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing  
>> from
>> 	default, state QUEUED
>> 	09/03/2007
>> 	
>> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing  
>> into
>> 	short, state 1 hop 1
>> 	09/03/2007
>> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
>> 	Queued at request of rcatwood at mt-hive2.mt.ic.ac.uk, owner =
>> 	rcatwood at mt-hive2.mt.ic.ac.uk, job name = STDIN, queue = short
>> 	09/03/2007
>> 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
>> 	sent command new
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
>> StatusServer request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type StatusNode request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
>> StatusQueue request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
>> ResourceQuery request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007
>> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
>> 	Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
>> 	09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007
>> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
>> 	Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
>> 	09/03/2007
>> 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
>> 	sent command recyc
>> 	09/03/2007 13:44:50;0100;PBS_Server;Req;;Type
>> AuthenticateUser request
>> 	received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:44:50;0100;PBS_Server;Req;;Type
>> StatusServer request
>> 	received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
>> 	09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob
>> request received
>> 	from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
>> JobObituary request
>> 	received from pbs_mom at node04, sock=9
>> 	09/03/2007
>> 	
>> 13:44:52;0010;PBS_Server;Job;12892.mt- 
>> hive2.mt.ic.ac.uk;Exit_status=-1
>> 	resources_used.cput=00:00:00 resources_used.mem=0kb
>> 	resources_used.vmem=0kb resources_used.walltime=00:00:20
>> 	09/03/2007
>> 	
>> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
>> 	valid pjob: 0x594790 (substate=50)
>> 	09/03/2007
>> 	
>> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
>> 	valid pjob: 0x594790 (substate=52)
>> 	09/03/2007
>> 	
>> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
>> 	valid pjob: 0x594790 (substate=53)
>> 	09/03/2007
>> 	
>> 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing  
>> from
>> 	short, state COMPLETE
>> 	09/03/2007
>> 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
>> 	sent command term
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
>> StatusServer request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type StatusNode request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
>> StatusQueue request
>> 	received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
>> request received
>> 	from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
>> 	09/03/2007 13:45:02;0100;PBS_Server;Req;;Type
>> AuthenticateUser request
>> 	received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
>> 	09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob
>> request received
>> 	from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
>> 	09/03/2007 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
>> 	code=15001(Unknown Job Id), aux=0, type=LocateJob, from
>> 	rcatwood at mt-hive2.mt.ic.ac.uk
>> 	_______________________________________________
>> 	torqueusers mailing list
>> 	torqueusers at supercluster.org
>> 	http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>> Aaron Knister
>> Associate Systems Administrator/Web Designer
>> Center for Research on Environment and Water
>>
>> (301) 595-7001
>> aaron at iges.org
>>
>>
>>
>>

Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070903/353d8b99/attachment-0001.html


More information about the torqueusers mailing list