[torqueusers] qsub -I from compute node?

Aquarijen aquarijen at gmail.com
Wed Oct 18 15:48:28 MDT 2006


Hi,
OK, I'll bite - if they are not "compute nodes", then how did you
designate them as login nodes?  Perhaps I am approaching this wrong -
I have my "login nodes" set up as compute nodes except that pbs_mom
doesn't run on them, so they don't get jobs scheduled to them. They
point to the node with pbs_server and on the server in torque.cfg, I
have:
SERVERHIST b08l02
ALLOWCOMPUTEHOSTSUBMIT true

Should I be using SUBMITHOSTS along with setting server acl_hosts and
acl_host_enable instead?  For some reason, I remember this not
working, but now I can't remember why. What I have commented out in
the torque.cfg is:
#SUBMITHOSTS
b08l01.oic.ornl.gov,b08l02.oic.ornl.gov,b07l01.oic.ornl.gov,b07l02.oic.ornl.gov,b06l01.oic.ornl.gov,b06l02.oic.ornl.gov,b05l01.oic.ornl.gov,b05l02.oic.ornl.gov

and I assume this was the last thing I tried in that regard.

Thanks!!
Jen

On 10/18/06, Garrick Staples <garrick at clusterresources.com> wrote:
> On Wed, Oct 18, 2006 at 05:20:07PM -0400, Troy Baer alleged:
> > On Wed, 2006-10-18 at 16:36 -0400, Aquarijen wrote:
> > > I have users who like to debug using an interactive pbs job.  We have
> > > 8 nodes designated for job submission and these all work fine when
> > > submitting batch jobs, but give an error when submitting an
> > > interactive job...  The only node that will not give an error when
> > > submitting an interactive job is the node that runs the pbs_server.
> > >
> > > This is what my users (and I) get when submitting on a node that is
> > > not running the pbs_server:
> > >
> > > [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> > > qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> > > qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> > > [2vt at b07l01 hello-parallel-worlds]$
> > >
> > > Doing a qstat, I breifly see this:
> > >
> > > [2vt at b08l02 ~]$ qstat
> > > 16225.b08l02        STDIN            2vt                     0 Q interq
> > > [2vt at b08l02 ~]$ qstat
> > > 16225.b08l02        STDIN            2vt                     0 R interq
> > >
> > > But then it is gone and I never actually get a node.  The job seems to
> > > wait for another 30 seconds or so and then give the "apparently
> > > deleted" message.
> > >
> > >
> > > This is what is in the pbs_server log:
> > >
> > > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> > > received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> > > 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> > > 2vt at b07l01.oic.ornl.gov, sock=9
> > > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> > > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> > > from 2vt at b07l01.oic.ornl.gov, sock=9
> > > 10/18/2006 16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> > > into interq, state 1 hop 1
> > > 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> > > Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> > > 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> > > <snip>
> > > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> > > received from 2vt at b07l01.oic.ornl.gov, sock=11
> > > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> > > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > > 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> > > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > > 2vt at b07l01.oic.ornl.gov
> > >
> > > Does anyone have any ideas for this?  I'd really appreciate the help -
> > > the users are getting restless. :P
> >
> > We ran into this a few months ago, but we didn't have a chance to fix or
> > report the problem at the time.  It's definitely a TORQUE-specific bug,
> > as OpenPBS on our older systems does *not* behave the same way.
> >
> > Here's some analysis on the problem that Doug Johnson did when we first
> > ran into it:
> > -----
> > We've run into a problem with interactive jobs and torque-2.0.0p8.
> > When submitting interactive jobs from hosts other than the node
> > running pbs_server the interactive job fails.  The end of the line for
> > the non-interactive job seems to be in,
>
> My users have 3 dozen interactive jobs running from non-pbs_server hosts
> right now.  Though I haven't tested from a compute node, we have
> multiple user login nodes that work fine.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


-- 
The more compassionate you are, the more generous you can be. The more
generous you are, the more loving-friendliness you cultivate to help
the world.

-Thich Nhat Hanh, "Buddhist Peacework"


More information about the torqueusers mailing list