[torqueusers] Troubleshooting qsub -I interactive queue jobs
James J Coyle
jjc at iastate.edu
Wed Sep 5 10:14:58 MDT 2007
To get /etc/resolv.conf to take precedence you'll need to change the file
/etc/nsswitch.conf
Try changing the entry for hosts in the file /etc/nsswitch.conf to:
hosts: files dns
This will read your /etc/hosts file first when resolving names.
You may need to reboot for this to take effect.
A better solution would be have correct settings in /etc/resolv.conf
by either
1) Stopping dhcp from over-writing /etc/resolv.conf in the first place
or
2) Over-writing /etc/resolv.conf on reboots yourself by:
A) Making a correct /etc/resolv.conf
B) Copying this to /etc/resolv.conf.correct on each node.
C) On each node edit the file /etc/rc3.d/S99local to append
the command /bin/cp -f /etc/resolv.conf.correct /etc/resolv.conf
(I am assuming that you come up in run level 3 and S99local is the
last rc script run.)
It is possible that you may need to your network after getting a
correct resolv.conf in place. That is likely done by
/sbin/service network restart
or by invoking the correct rc3.d script with an argument of restart.
>
> Yes, batch jobs appear to work correctly.
> Trace job does not return anything that explains to me what happened,
> the job ran and exited with -1 status. Symptoms from the console running
> were simply nothing, just the messages as captured with no input command
> line available between messages.
>
> I believe you are correct that it's a nameserver issue, but I don't see
> why the node loses track of the address of the master just on the
> internal network. After all, the master is stuffed into each node's
> /etc/hosts as well.
>
> So, I am going to ask questions to those who can hopefully help sort
> out the dhcp ; any torque-specific information about how the name
> resolution affects this kind of job would still be helpful. I think the
> diagnostic output contains insufficient information to find out what the
> problem actually is,however! Happy to be shown otherwise.
> Robert
>
>
>
>
>
> >>>>>>>>>>> node /etc/hosts file <<<<<<<<<<<<<<<<<
> 127.0.0.1 localhost.localdomain localhost
> 10.141.255.254 master.beowulf.cluster master hive2 mt-hive2
>
> >>>>>>>>>>>>> command line caputre <<<<<<<<<<<<<<<<<<<<<<
>
> > ~> qsub -I -l nodes=node04.beowulf.cluster
> > qsub: waiting for job
> > 12892.mt-hive2.mt.ic.ac.uk to start
> > qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
>
> >>>>>>>>> tracejob -n 5 results <<<<<<<<<<<<<<<<<<<<<<<<<<<<
> Job: 12892.mt-hive2.mt.ic.ac.uk
>
> 09/03/2007 13:44:32 S enqueuing into default, state 1 hop 1
> 09/03/2007 13:44:32 S dequeuing from default, state QUEUED
> 09/03/2007 13:44:32 S enqueuing into short, state 1 hop 1
> 09/03/2007 13:44:32 S Job Queued at request of
> rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> rcatwood at mt-hive2.mt.ic.ac.uk, job name =
> STDIN, queue = short
> 09/03/2007 13:44:32 S Job Modified at request of
> Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:32 S Job Run at request of
> Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:32 A queue=default
> 09/03/2007 13:44:32 A queue=short
> 09/03/2007 13:44:37 L Job Run
> 09/03/2007 13:44:37 A user=rcatwood group=pg jobname=STDIN
> queue=short ctime=1188823472 qtime=1188823472
> etime=1188823472 start=1188823477
> exec_host=node04.beowulf.cluster/0
> Resource_List.neednodes=node04.beowulf.cluster
> Resource_List.nice=16
> Resource_List.nodect=1
> Resource_List.nodes=node04.beowulf.cluster
> Resource_List.walltime=04:00:00
> 09/03/2007 13:44:52 S Exit_status=-1 resources_used.cput=00:00:00
> resources_used.mem=0kb
> resources_used.vmem=0kb
> resources_used.walltime=00:00:20
> 09/03/2007 13:44:52 S on_job_exit valid pjob: 0x594790 (substate=50)
> 09/03/2007 13:44:52 S on_job_exit valid pjob: 0x594790 (substate=52)
> 09/03/2007 13:44:52 S on_job_exit valid pjob: 0x594790 (substate=53)
> 09/03/2007 13:44:52 S dequeuing from short, state COMPLETE
> 09/03/2007 13:44:52 A user=rcatwood group=pg jobname=STDIN
> queue=short ctime=1188823472 qtime=1188823472
> etime=1188823472 start=1188823477
> exec_host=node04.beowulf.cluster/0
> Resource_List.neednodes=node04.beowulf.cluster
> Resource_List.nice=16
> Resource_List.nodect=1
> Resource_List.nodes=node04.beowulf.cluster
> Resource_List.walltime=04:00:00 session=0
> end=1188823492 Exit_status=-1
> resources_used.cput=00:00:00
> resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:00:20
>
> > -----Original Message-----
> > From: Aaron Knister [mailto:aaron at iges.org]
> > Sent: 04 September 2007 01:45
> > To: Atwood, Robert C
> > Cc: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] Troubleshooting qsub -I
> > interactive queue jobs
> >
> > Batch jobs run fine, you say?
> >
> > Also can you run tracejob jobid on an interactive job you
> > tried (that failed).
> >
> > -Aaron
> >
> > On Sep 3, 2007, at 11:17 AM, Atwood, Robert C wrote:
> >
> >
> > Hi:
> >
> > Yes, the dhcp client was overwriting the
> > /etc/resolv.conf I tried
> > stuffing everything in /etc/hosts. But that is not
> > working. Perhaps I
> > have done it incorrectly . But I am able to login to
> > nodes by name, and
> > log from node to master either as 'master' or as its
> > hostname. However,
> > I cannot log to an outside machine from the node using
> > the hostname, I
> > also need to solve that (not for Torque though, but if
> > you recommmend a
> > document to read I would be grateful, I have not
> > actually had to deal
> > with this before since the initial configuration JUST
> > WORKED until now,
> > and on the previous cluster)
> >
> >
> > With this situation I have the behaviour I described in
> > the previous
> > message.
> >
> > Thanks again
> > Robert
> >
> >
> > I have done the files like so.
> >
> > /etc/hosts:
> >
> >
> > 127.0.0.1 localhost.beowulf.cluster localhost
> > 10.141.255.254 master.beowulf.cluster master
> >
> > 10.141.0.1 node01 node01.beowulf.cluster
> > (etc)
> >
> > /var/spool/torque/server_priv/nodes:
> >
> > node01 np=2 x11
> > (etc)
> >
> >
> >
> >
> > -----Original Message-----
> > From: Aaron Knister [mailto:aaron at iges.org]
> > Sent: 03 September 2007 15:47
> > To: Atwood, Robert C
> > Cc: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] Troubleshooting qsub -I
> > interactive queue jobs
> >
> > How is the cluster doing name resolution? Have
> > you stuffed
> > everything in /etc/hosts or is there a local dns server
> > running? If you're running a local dns server
> > check to make
> > sure the dhcp client hasn't overwritten your
> > local dns server
> > in /etc/resolv.conf.
> >
> > -Aaron
> >
> > On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:
> >
> >
> > Hi,
> > For some reason jobs using qsub -I are immediately
> > exiting. Until very
> > recently this was not happening, -I jobs worked
> > correctly. The main
> > change is that the outside network has changed requirng
> > us to run dhcp
> > client for the outside network. I am not sure why this
> > should affect the
> > cluster network but that's all I can think of that's
> > different from just
> > a few days ago when this was working.
> > I have looked at MOM log on the node and server log on
> > the master
> > (appended below), I don't see what is wrong, it just
> > says 'Failure job
> > exec failure'? What does this mean and how may I find
> > out what is
> > causing it?
> >
> >
> > Thanks
> > Robert
> >
> >
> >
> >
> >
> >
> >
> >
> > command line capture <<<<<<<<<<<<<<<<
> >
> > ~> qsub -I -l nodes=node04.beowulf.cluster
> > qsub: waiting for job
> > 12892.mt-hive2.mt.ic.ac.uk to start
> > qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
> >
> >
> >
> >
> >
> >
> > mom_log on node04
> > <<<<<<<<<<<<<<<<<<<<
> >
> > 09/03/2007 13:44:32;0100; pbs_mom;Req;;Type QueueJob
> > request received
> > from PBS_Server at master.beowulf.cluster, sock=10
> > 09/03/2007 13:44:32;0100; pbs_mom;Req;;Type
> > ReadyToCommit request
> > received from PBS_Server at master.beowulf.cluster, sock=10
> > 09/03/2007 13:44:32;0100; pbs_mom;Req;;Type Commit
> > request received
> > from PBS_Server at master.beowulf.cluster, sock=10
> > 09/03/2007 13:44:39;0100; pbs_mom;Req;;Type StatusJob
> > request received
> > from PBS_Server at master.beowulf.cluster, sock=11
> > 09/03/2007 13:44:52;0001;
> > pbs_mom;Job;TMomFinalizeJob3;job not
> > started, Failure job exec failure, before files
> > staged, no retry
> > 09/03/2007 13:44:52;0008;
> > pbs_mom;Req;send_sisters;sending ABORT to
> > sisters
> > 09/03/2007 13:44:52;0100; pbs_mom;Req;;Type DeleteJob
> > request received
> > from PBS_Server at master.beowulf.cluster, sock=10
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > server log on master <<<<<<<<<<<<<<<<<<<<
> >
> > 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> > AuthenticateUser request
> > received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob
> > request received
> > from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> > ReadyToCommit request
> > received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit
> > request received
> > from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> > default, state 1 hop 1
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> > default, state QUEUED
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> > short, state 1 hop 1
> > 09/03/2007
> >
> > 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> > Queued at request of
> > rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> > rcatwood at mt-hive2.mt.ic.ac.uk, job name =
> > STDIN, queue = short
> > 09/03/2007
> >
> > 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> > sent command new
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> > StatusServer request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> > StatusNode request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> > StatusQueue request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> > ResourceQuery request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007
> >
> > 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> > Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
> > 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007
> >
> > 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> > Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
> > 09/03/2007
> >
> > 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> > sent command recyc
> > 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type
> > AuthenticateUser request
> > received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type
> > StatusServer request
> > received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> > 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob
> > request received
> > from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> > JobObituary request
> > received from pbs_mom at node04, sock=9
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
> > resources_used.cput=00:00:00 resources_used.mem=0kb
> > resources_used.vmem=0kb resources_used.walltime=00:00:20
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> > valid pjob: 0x594790 (substate=50)
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> > valid pjob: 0x594790 (substate=52)
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> > valid pjob: 0x594790 (substate=53)
> > 09/03/2007
> >
> >
> >
> >
> >
> > 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> > short, state COMPLETE
> > 09/03/2007
> >
> > 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> > sent command term
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> > StatusServer request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> > StatusNode request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> > StatusQueue request
> > received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> > request received
> > from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> > 09/03/2007 13:45:02;0100;PBS_Server;Req;;Type
> > AuthenticateUser request
> > received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> > 09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob
> > request received
> > from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> > 09/03/2007
> > 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > rcatwood at mt-hive2.mt.ic.ac.uk
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> > Aaron Knister
> > Associate Systems Administrator/Web Designer
> > Center for Research on Environment and Water
> >
> > (301) 595-7001
> > aaron at iges.org
> >
> >
> >
> >
> >
> >
> > Aaron Knister
> > Associate Systems Administrator/Web Designer
> > Center for Research on Environment and Water
> >
> > (301) 595-7001
> > aaron at iges.org
> >
> >
> >
> >
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list