[torqueusers] qsub -I from compute node?

Garrick Staples garrick at clusterresources.com
Wed Oct 18 15:27:49 MDT 2006


On Wed, Oct 18, 2006 at 05:20:07PM -0400, Troy Baer alleged:
> On Wed, 2006-10-18 at 16:36 -0400, Aquarijen wrote:
> > I have users who like to debug using an interactive pbs job.  We have
> > 8 nodes designated for job submission and these all work fine when
> > submitting batch jobs, but give an error when submitting an
> > interactive job...  The only node that will not give an error when
> > submitting an interactive job is the node that runs the pbs_server.
> > 
> > This is what my users (and I) get when submitting on a node that is
> > not running the pbs_server:
> > 
> > [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> > qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> > qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> > [2vt at b07l01 hello-parallel-worlds]$
> > 
> > Doing a qstat, I breifly see this:
> > 
> > [2vt at b08l02 ~]$ qstat
> > 16225.b08l02        STDIN            2vt                     0 Q interq
> > [2vt at b08l02 ~]$ qstat
> > 16225.b08l02        STDIN            2vt                     0 R interq
> > 
> > But then it is gone and I never actually get a node.  The job seems to
> > wait for another 30 seconds or so and then give the "apparently
> > deleted" message.
> > 
> > 
> > This is what is in the pbs_server log:
> > 
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> > 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> > 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> > from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> > into interq, state 1 hop 1
> > 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> > Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> > 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> > <snip>
> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b07l01.oic.ornl.gov, sock=11
> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > 2vt at b07l01.oic.ornl.gov
> > 
> > Does anyone have any ideas for this?  I'd really appreciate the help -
> > the users are getting restless. :P
> 
> We ran into this a few months ago, but we didn't have a chance to fix or
> report the problem at the time.  It's definitely a TORQUE-specific bug,
> as OpenPBS on our older systems does *not* behave the same way.
> 
> Here's some analysis on the problem that Doug Johnson did when we first
> ran into it:
> -----
> We've run into a problem with interactive jobs and torque-2.0.0p8.
> When submitting interactive jobs from hosts other than the node
> running pbs_server the interactive job fails.  The end of the line for
> the non-interactive job seems to be in,

My users have 3 dozen interactive jobs running from non-pbs_server hosts
right now.  Though I haven't tested from a compute node, we have
multiple user login nodes that work fine.



More information about the torqueusers mailing list