[torqueusers] Interactive jobs not working

Jason Bacon jwbacon at tds.net
Fri Nov 11 11:39:51 MST 2011


Hi all,

I'm having some trouble with interactive jobs on both torque 2.5.8 and 
3.0.2.

If I submit a script such as this one:

==========
#!/bin/sh

#PBS -N hostname
#PBS -I

hostname
==========

using "qsub hostname.pbs", I get the following email:

Message 1:
 From adm at hpc1.cs.uwm.edu Fri Nov 11 12:17:27 2011
Date: Fri, 11 Nov 2011 12:17:26 -0600 (CST)
From: adm at hpc1.cs.uwm.edu
To: bacon at hpc1.cs.uwm.edu
Subject: PBS JOB 47.hpc1.cs.uwm.edu

PBS Job Id: 47.hpc1.cs.uwm.edu
Job Name:   hostname
Exec host:  hpc1-1/0
Aborted by PBS Server
Job cannot be executed
See Administrator for help

and the following appears in the mom log on the compute node:

11/11/2011 12:19:20;0001;   pbs_mom;Job;47.hpc1.cs.uwm.edu;phase 2 of 
job launch successfully completed
11/11/2011 12:19:20;0001;   pbs_mom;Job;TMomFinalizeJob3;Job 
47.hpc1.cs.uwm.edu read start return code=-1 session=0
11/11/2011 12:19:20;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Failure job exec failure, before files staged, no retry (see 
syslog for more information)

Syslog:

Nov 11 12:19:20 hpc1-1 pbs_mom: LOG_ERROR::Connection refused (61) in 
TMomFinalizeChild, cannot open interactive qsub socket to host 
hpc1.cs.uwm.edu:28169 - 'cannot connect to port 777 in client_to_svr - 
connection refused' - check routing tables/multi-homed host issues

The full log for this job (loglevel=3) is included below.

If I use -I on the command line instead of in the script, it drops me 
into an interactive shell on the scheduled node instead of running the 
script:

==============
FreeBSD hpc1 bacon ~ 454: qsub -I hostname.pbs
qsub: waiting for job 50.hpc1.cs.uwm.edu to start
qsub: job 50.hpc1.cs.uwm.edu ready

12:25PM  up 49 days,  1:50, 0 users, load averages: 0.00, 0.00, 0.00
FreeBSD hpc1-1 bacon ~ 401:
==============

Batch, array, and MPI jobs all work fine.

The only other issue I've seen is that the output does not appear in 
/var/spool/torque/spool during job execution.  Output files are created 
when the job finishes, so it must be storing it somewhere else 
temporarily.  Not sure if this is a program, configuration, or 
documentation issue, or whether it has any relation to the interactive 
jobs issue.  I can work around this one by configuring with 
--disable-spool.  In that case, partial output shows up in the user's 
home directory as expected.

I've been Googling for a while on this, and found several others with 
similar issues, but no solution that works for me.

I'm testing this on a small cluster with no multi-homed hosts and no 
firewall.

Any hints about what the root cause might be would be appreciated.

Thanks,

     Jason

Full mom log:

11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008;   pbs_mom;Job;process_request;request type 
QueueJob from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0100;   pbs_mom;Req;;Type JobScript request received 
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008;   pbs_mom;Job;process_request;request type 
JobScript from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008;   pbs_mom;Job;process_request;request type 
ReadyToCommit from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0100;   pbs_mom;Req;;Type Commit request received 
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008;   pbs_mom;Job;process_request;request type 
Commit from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0001;   pbs_mom;Job;job_nodes;job: 
47.hpc1.cs.uwm.edu numnodes=1 numvnod=1
11/11/2011 12:19:20;0001;   pbs_mom;Job;47.hpc1.cs.uwm.edu;phase 2 of 
job launch successfully completed
11/11/2011 12:19:20;0001;   pbs_mom;Job;TMomFinalizeJob3;Job 
47.hpc1.cs.uwm.edu read start return code=-1 session=0
11/11/2011 12:19:20;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Failure job exec failure, before files staged, no retry (see 
syslog for more information)
11/11/2011 12:19:20;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters for job 47.hpc1.cs.uwm.edu
11/11/2011 12:19:20;0080;   pbs_mom;Svr;scan_for_exiting;searching for 
exiting jobs
11/11/2011 12:19:20;0008;   pbs_mom;Job;kill_job;scan_for_exiting: 
sending signal 9, "KILL" to job 47.hpc1.cs.uwm.edu, reason: local task 
termination detected
11/11/2011 12:19:20;0080;   pbs_mom;Job;47.hpc1.cs.uwm.edu;sending 
preobit jobstat
11/11/2011 12:19:20;0008;   pbs_mom;Job;scan_for_terminated;pid 53916 
not tracked, exitcode=254
11/11/2011 12:19:20;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
11/11/2011 12:19:20;0080;   
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
11/11/2011 12:19:20;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
11/11/2011 12:19:20;0080;   pbs_mom;Job;47.hpc1.cs.uwm.edu;performing 
job clean-up in preobit_reply()
11/11/2011 12:19:20;0080;   pbs_mom;Job;47.hpc1.cs.uwm.edu;epilog 
subtask created with pid 53917 - substate set to JOB_SUBSTATE_OBIT - 
registered post_epilogue
11/11/2011 12:19:20;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008;   pbs_mom;Job;process_request;request type 
StatusJob from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0008;   pbs_mom;Job;47.hpc1.cs.uwm.edu;checking job 
post-processing routine
11/11/2011 12:19:20;0080;   pbs_mom;Req;post_epilogue;preparing obit 
message for job 47.hpc1.cs.uwm.edu
11/11/2011 12:19:20;0080;   pbs_mom;Job;47.hpc1.cs.uwm.edu;obit sent to 
server
11/11/2011 12:19:20;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at hpc1.cs.uwm.edu, sock=11
11/11/2011 12:19:20;0008;   pbs_mom;Job;process_request;request type 
DeleteJob from host hpc1.cs.uwm.edu allowed
11/11/2011 12:19:20;0008;   pbs_mom;Job;47.hpc1.cs.uwm.edu;deleting job
11/11/2011 12:19:20;0080;   pbs_mom;Job;47.hpc1.cs.uwm.edu;deleting job 
47.hpc1.cs.uwm.edu in state EXITED
11/11/2011 12:19:20;0080;   pbs_mom;Job;47.hpc1.cs.uwm.edu;removed job 
script
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:20;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/11/2011 12:19:57;0002;   pbs_mom;n/a;setmax;setmax: dev /dev/ttyu0 
access 1316795691 replaces max 0
11/11/2011 12:19:57;0002;   pbs_mom;n/a;mom_server_update_stat;status 
update successfully sent to hpc1


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason W. Bacon
jwbacon at tds.net
http://personalpages.tds.net/~jwbacon
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




More information about the torqueusers mailing list