[torqueusers] Interactive jobs not working
Jason Bacon
jwbacon at tds.net
Fri Nov 11 11:39:51 MST 2011
Hi all,
I'm having some trouble with interactive jobs on both torque 2.5.8 and
3.0.2.
If I submit a script such as this one:
==========
#!/bin/sh
#PBS -N hostname
#PBS -I
hostname
==========
using "qsub hostname.pbs", I get the following email:
Message 1:
From adm at hpc1.cs.uwm.edu Fri Nov 11 12:17:27 2011
Date: Fri, 11 Nov 2011 12:17:26 -0600 (CST)
From: adm at hpc1.cs.uwm.edu
To: bacon at hpc1.cs.uwm.edu
Subject: PBS JOB 47.hpc1.cs.uwm.edu
PBS Job Id: 47.hpc1.cs.uwm.edu
Job Name: hostname
Exec host: hpc1-1/0
Aborted by PBS Server
Job cannot be executed
See Administrator for help
and the following appears in the mom log on the compute node:
11/11/2011 12:19:20;0001; pbs_mom;Job;47.hpc1.cs.uwm.edu;phase 2 of
job launch successfully completed
11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;Job
47.hpc1.cs.uwm.edu read start return code=-1 session=0
11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, before files staged, no retry (see
syslog for more information)
Syslog:
Nov 11 12:19:20 hpc1-1 pbs_mom: LOG_ERROR::Connection refused (61) in
TMomFinalizeChild, cannot open interactive qsub socket to host
hpc1.cs.uwm.edu:28169 - 'cannot connect to port 777 in client_to_svr -
connection refused' - check routing tables/multi-homed host issues
The full log for this job (loglevel=3) is included below.
If I use -I on the command line instead of in the script, it drops me
into an interactive shell on the scheduled node instead of running the
script:
==============
FreeBSD hpc1 bacon ~ 454: qsub -I hostname.pbs
qsub: waiting for job 50.hpc1.cs.uwm.edu to start
qsub: job 50.hpc1.cs.uwm.edu ready
12:25PM up 49 days, 1:50, 0 users, load averages: 0.00, 0.00, 0.00
FreeBSD hpc1-1 bacon ~ 401:
==============
Batch, array, and MPI jobs all work fine.
The only other issue I've seen is that the output does not appear in
/var/spool/torque/spool during job execution. Output files are created
when the job finishes, so it must be storing it somewhere else
temporarily. Not sure if this is a program, configuration, or
documentation issue, or whether it has any relation to the interactive
jobs issue. I can work around this one by configuring with
--disable-spool. In that case, partial output shows up in the user's
home directory as expected.
I've been Googling for a while on this, and found several others with
similar issues, but no solution that works for me.
I'm testing this on a small cluster with no multi-homed hosts and no
firewall.
Any hints about what the root cause might be would be appreciated.
Thanks,
Jason
Full mom log:
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0100; pbs_mom;Req;;Type QueueJob request received
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type
QueueJob from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0100; pbs_mom;Req;;Type JobScript request received
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type
JobScript from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0100; pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type
ReadyToCommit from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0100; pbs_mom;Req;;Type Commit request received
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type
Commit from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0001; pbs_mom;Job;job_nodes;job:
47.hpc1.cs.uwm.edu numnodes=1 numvnod=1
11/11/2011 12:19:20;0001; pbs_mom;Job;47.hpc1.cs.uwm.edu;phase 2 of
job launch successfully completed
11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;Job
47.hpc1.cs.uwm.edu read start return code=-1 session=0
11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, before files staged, no retry (see
syslog for more information)
11/11/2011 12:19:20;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters for job 47.hpc1.cs.uwm.edu
11/11/2011 12:19:20;0080; pbs_mom;Svr;scan_for_exiting;searching for
exiting jobs
11/11/2011 12:19:20;0008; pbs_mom;Job;kill_job;scan_for_exiting:
sending signal 9, "KILL" to job 47.hpc1.cs.uwm.edu, reason: local task
termination detected
11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;sending
preobit jobstat
11/11/2011 12:19:20;0008; pbs_mom;Job;scan_for_terminated;pid 53916
not tracked, exitcode=254
11/11/2011 12:19:20;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/11/2011 12:19:20;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
11/11/2011 12:19:20;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;performing
job clean-up in preobit_reply()
11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;epilog
subtask created with pid 53917 - substate set to JOB_SUBSTATE_OBIT -
registered post_epilogue
11/11/2011 12:19:20;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type
StatusJob from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0008; pbs_mom;Job;47.hpc1.cs.uwm.edu;checking job
post-processing routine
11/11/2011 12:19:20;0080; pbs_mom;Req;post_epilogue;preparing obit
message for job 47.hpc1.cs.uwm.edu
11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;obit sent to
server
11/11/2011 12:19:20;0100; pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type
DeleteJob from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0008; pbs_mom;Job;47.hpc1.cs.uwm.edu;deleting job
11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;deleting job
47.hpc1.cs.uwm.edu in state EXITED
11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;removed job
script
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in
rm_request
11/11/2011 12:19:57;0002; pbs_mom;n/a;setmax;setmax: dev /dev/ttyu0
access 1316795691 replaces max 0
11/11/2011 12:19:57;0002; pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to hpc1
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason W. Bacon
jwbacon at tds.net
http://personalpages.tds.net/~jwbacon
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More information about the torqueusers
mailing list