[torqueusers] Troubleshooting qsub -I interactive queue jobs
Atwood, Robert C
r.atwood at imperial.ac.uk
Wed Sep 5 07:35:47 MDT 2007
Yes, batch jobs appear to work correctly.
Trace job does not return anything that explains to me what happened,
the job ran and exited with -1 status. Symptoms from the console running
were simply nothing, just the messages as captured with no input command
line available between messages.
I believe you are correct that it's a nameserver issue, but I don't see
why the node loses track of the address of the master just on the
internal network. After all, the master is stuffed into each node's
/etc/hosts as well.
So, I am going to ask questions to those who can hopefully help sort
out the dhcp ; any torque-specific information about how the name
resolution affects this kind of job would still be helpful. I think the
diagnostic output contains insufficient information to find out what the
problem actually is,however! Happy to be shown otherwise.
Robert
>>>>>>>>>>> node /etc/hosts file <<<<<<<<<<<<<<<<<
127.0.0.1 localhost.localdomain localhost
10.141.255.254 master.beowulf.cluster master hive2 mt-hive2
>>>>>>>>>>>>> command line caputre <<<<<<<<<<<<<<<<<<<<<<
> ~> qsub -I -l nodes=node04.beowulf.cluster
> qsub: waiting for job
> 12892.mt-hive2.mt.ic.ac.uk to start
> qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
>>>>>>>>> tracejob -n 5 results <<<<<<<<<<<<<<<<<<<<<<<<<<<<
Job: 12892.mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:32 S enqueuing into default, state 1 hop 1
09/03/2007 13:44:32 S dequeuing from default, state QUEUED
09/03/2007 13:44:32 S enqueuing into short, state 1 hop 1
09/03/2007 13:44:32 S Job Queued at request of
rcatwood at mt-hive2.mt.ic.ac.uk, owner =
rcatwood at mt-hive2.mt.ic.ac.uk, job name =
STDIN, queue = short
09/03/2007 13:44:32 S Job Modified at request of
Scheduler at mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:32 S Job Run at request of
Scheduler at mt-hive2.mt.ic.ac.uk
09/03/2007 13:44:32 A queue=default
09/03/2007 13:44:32 A queue=short
09/03/2007 13:44:37 L Job Run
09/03/2007 13:44:37 A user=rcatwood group=pg jobname=STDIN
queue=short ctime=1188823472 qtime=1188823472
etime=1188823472 start=1188823477
exec_host=node04.beowulf.cluster/0
Resource_List.neednodes=node04.beowulf.cluster
Resource_List.nice=16
Resource_List.nodect=1
Resource_List.nodes=node04.beowulf.cluster
Resource_List.walltime=04:00:00
09/03/2007 13:44:52 S Exit_status=-1 resources_used.cput=00:00:00
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:20
09/03/2007 13:44:52 S on_job_exit valid pjob: 0x594790 (substate=50)
09/03/2007 13:44:52 S on_job_exit valid pjob: 0x594790 (substate=52)
09/03/2007 13:44:52 S on_job_exit valid pjob: 0x594790 (substate=53)
09/03/2007 13:44:52 S dequeuing from short, state COMPLETE
09/03/2007 13:44:52 A user=rcatwood group=pg jobname=STDIN
queue=short ctime=1188823472 qtime=1188823472
etime=1188823472 start=1188823477
exec_host=node04.beowulf.cluster/0
Resource_List.neednodes=node04.beowulf.cluster
Resource_List.nice=16
Resource_List.nodect=1
Resource_List.nodes=node04.beowulf.cluster
Resource_List.walltime=04:00:00 session=0
end=1188823492 Exit_status=-1
resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:20
> -----Original Message-----
> From: Aaron Knister [mailto:aaron at iges.org]
> Sent: 04 September 2007 01:45
> To: Atwood, Robert C
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Troubleshooting qsub -I
> interactive queue jobs
>
> Batch jobs run fine, you say?
>
> Also can you run tracejob jobid on an interactive job you
> tried (that failed).
>
> -Aaron
>
> On Sep 3, 2007, at 11:17 AM, Atwood, Robert C wrote:
>
>
> Hi:
>
> Yes, the dhcp client was overwriting the
> /etc/resolv.conf I tried
> stuffing everything in /etc/hosts. But that is not
> working. Perhaps I
> have done it incorrectly . But I am able to login to
> nodes by name, and
> log from node to master either as 'master' or as its
> hostname. However,
> I cannot log to an outside machine from the node using
> the hostname, I
> also need to solve that (not for Torque though, but if
> you recommmend a
> document to read I would be grateful, I have not
> actually had to deal
> with this before since the initial configuration JUST
> WORKED until now,
> and on the previous cluster)
>
>
> With this situation I have the behaviour I described in
> the previous
> message.
>
> Thanks again
> Robert
>
>
> I have done the files like so.
>
> /etc/hosts:
>
>
> 127.0.0.1 localhost.beowulf.cluster localhost
> 10.141.255.254 master.beowulf.cluster master
>
> 10.141.0.1 node01 node01.beowulf.cluster
> (etc)
>
> /var/spool/torque/server_priv/nodes:
>
> node01 np=2 x11
> (etc)
>
>
>
>
> -----Original Message-----
> From: Aaron Knister [mailto:aaron at iges.org]
> Sent: 03 September 2007 15:47
> To: Atwood, Robert C
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Troubleshooting qsub -I
> interactive queue jobs
>
> How is the cluster doing name resolution? Have
> you stuffed
> everything in /etc/hosts or is there a local dns server
> running? If you're running a local dns server
> check to make
> sure the dhcp client hasn't overwritten your
> local dns server
> in /etc/resolv.conf.
>
> -Aaron
>
> On Sep 3, 2007, at 8:56 AM, Atwood, Robert C wrote:
>
>
> Hi,
> For some reason jobs using qsub -I are immediately
> exiting. Until very
> recently this was not happening, -I jobs worked
> correctly. The main
> change is that the outside network has changed requirng
> us to run dhcp
> client for the outside network. I am not sure why this
> should affect the
> cluster network but that's all I can think of that's
> different from just
> a few days ago when this was working.
> I have looked at MOM log on the node and server log on
> the master
> (appended below), I don't see what is wrong, it just
> says 'Failure job
> exec failure'? What does this mean and how may I find
> out what is
> causing it?
>
>
> Thanks
> Robert
>
>
>
>
>
>
>
>
> command line capture <<<<<<<<<<<<<<<<
>
> ~> qsub -I -l nodes=node04.beowulf.cluster
> qsub: waiting for job
> 12892.mt-hive2.mt.ic.ac.uk to start
> qsub: job 12892.mt-hive2.mt.ic.ac.uk apparently deleted
>
>
>
>
>
>
> mom_log on node04
> <<<<<<<<<<<<<<<<<<<<
>
> 09/03/2007 13:44:32;0100; pbs_mom;Req;;Type QueueJob
> request received
> from PBS_Server at master.beowulf.cluster, sock=10
> 09/03/2007 13:44:32;0100; pbs_mom;Req;;Type
> ReadyToCommit request
> received from PBS_Server at master.beowulf.cluster, sock=10
> 09/03/2007 13:44:32;0100; pbs_mom;Req;;Type Commit
> request received
> from PBS_Server at master.beowulf.cluster, sock=10
> 09/03/2007 13:44:39;0100; pbs_mom;Req;;Type StatusJob
> request received
> from PBS_Server at master.beowulf.cluster, sock=11
> 09/03/2007 13:44:52;0001;
> pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, before files
> staged, no retry
> 09/03/2007 13:44:52;0008;
> pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 09/03/2007 13:44:52;0100; pbs_mom;Req;;Type DeleteJob
> request received
> from PBS_Server at master.beowulf.cluster, sock=10
>
>
>
>
>
>
>
>
>
> server log on master <<<<<<<<<<<<<<<<<<<<
>
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:26;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> AuthenticateUser request
> received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type QueueJob
> request received
> from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> ReadyToCommit request
> received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type Commit
> request received
> from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007
>
>
>
>
>
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> default, state 1 hop 1
> 09/03/2007
>
>
>
>
>
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> default, state QUEUED
> 09/03/2007
>
>
>
>
>
> 13:44:32;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;enqueuing into
> short, state 1 hop 1
> 09/03/2007
>
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> Queued at request of
> rcatwood at mt-hive2.mt.ic.ac.uk, owner =
> rcatwood at mt-hive2.mt.ic.ac.uk, job name =
> STDIN, queue = short
> 09/03/2007
>
> 13:44:32;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> sent command new
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> StatusServer request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> StatusNode request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> StatusQueue request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type
> ResourceQuery request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type ModifyJob
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007
>
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> Modified at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007 13:44:32;0100;PBS_Server;Req;;Type RunJob
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007
>
> 13:44:32;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Job
> Run at request of Scheduler at mt-hive2.mt.ic.ac.uk
> 09/03/2007
>
> 13:44:37;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> sent command recyc
> 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type
> AuthenticateUser request
> received from rsingh at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type
> StatusServer request
> received from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:50;0100;PBS_Server;Req;;Type StatusJob
> request received
> from rsingh at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> JobObituary request
> received from pbs_mom at node04, sock=9
> 09/03/2007
>
>
>
>
>
> 13:44:52;0010;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;Exit_status=-1
> resources_used.cput=00:00:00 resources_used.mem=0kb
> resources_used.vmem=0kb resources_used.walltime=00:00:20
> 09/03/2007
>
>
>
>
>
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> valid pjob: 0x594790 (substate=50)
> 09/03/2007
>
>
>
>
>
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> valid pjob: 0x594790 (substate=52)
> 09/03/2007
>
>
>
>
>
> 13:44:52;0008;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;on_job_exit
> valid pjob: 0x594790 (substate=53)
> 09/03/2007
>
>
>
>
>
> 13:44:52;0100;PBS_Server;Job;12892.mt-hive2.mt.ic.ac.uk;dequeuing from
> short, state COMPLETE
> 09/03/2007
>
> 13:44:52;0040;PBS_Server;Svr;mt-hive2.mt.ic.ac.uk;Scheduler
> sent command term
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> StatusServer request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> StatusNode request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type
> StatusQueue request
> received from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:44:52;0100;PBS_Server;Req;;Type SelStat
> request received
> from Scheduler at mt-hive2.mt.ic.ac.uk, sock=11
> 09/03/2007 13:45:02;0100;PBS_Server;Req;;Type
> AuthenticateUser request
> received from rcatwood at mt-hive2.mt.ic.ac.uk, sock=10
> 09/03/2007 13:45:02;0100;PBS_Server;Req;;Type LocateJob
> request received
> from rcatwood at mt-hive2.mt.ic.ac.uk, sock=9
> 09/03/2007
> 13:45:02;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> rcatwood at mt-hive2.mt.ic.ac.uk
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
>
> (301) 595-7001
> aaron at iges.org
>
>
>
>
>
>
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
>
> (301) 595-7001
> aaron at iges.org
>
>
>
>
More information about the torqueusers
mailing list