[torqueusers] qsub -I from compute node?

Garrick Staples garrick at usc.edu
Wed Oct 18 15:55:08 MDT 2006


On Wed, Oct 18, 2006 at 05:48:28PM -0400, Aquarijen alleged:
> Hi,
> OK, I'll bite - if they are not "compute nodes", then how did you
> designate them as login nodes?  Perhaps I am approaching this wrong -
> I have my "login nodes" set up as compute nodes except that pbs_mom
> doesn't run on them, so they don't get jobs scheduled to them. They
> point to the node with pbs_server and on the server in torque.cfg, I
> have:
> SERVERHIST b08l02
> ALLOWCOMPUTEHOSTSUBMIT true
> 
> Should I be using SUBMITHOSTS along with setting server acl_hosts and
> acl_host_enable instead?  For some reason, I remember this not
> working, but now I can't remember why. What I have commented out in
> the torque.cfg is:
> #SUBMITHOSTS
> b08l01.oic.ornl.gov,b08l02.oic.ornl.gov,b07l01.oic.ornl.gov,b07l02.oic.ornl.gov,b06l01.oic.ornl.gov,b06l02.oic.ornl.gov,b05l01.oic.ornl.gov,b05l02.oic.ornl.gov
> 
> and I assume this was the last thing I tried in that regard.

If you are in 2.1.x, pbs_server no longer reads torque.cfg, everything
has been replaced by proper server attributes.

I just add the login nodes to /etc/hosts.equiv.


 
> Thanks!!
> Jen
> 
> On 10/18/06, Garrick Staples <garrick at clusterresources.com> wrote:
> >On Wed, Oct 18, 2006 at 05:20:07PM -0400, Troy Baer alleged:
> >> On Wed, 2006-10-18 at 16:36 -0400, Aquarijen wrote:
> >> > I have users who like to debug using an interactive pbs job.  We have
> >> > 8 nodes designated for job submission and these all work fine when
> >> > submitting batch jobs, but give an error when submitting an
> >> > interactive job...  The only node that will not give an error when
> >> > submitting an interactive job is the node that runs the pbs_server.
> >> >
> >> > This is what my users (and I) get when submitting on a node that is
> >> > not running the pbs_server:
> >> >
> >> > [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> >> > qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> >> > qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> >> > [2vt at b07l01 hello-parallel-worlds]$
> >> >
> >> > Doing a qstat, I breifly see this:
> >> >
> >> > [2vt at b08l02 ~]$ qstat
> >> > 16225.b08l02        STDIN            2vt                     0 Q interq
> >> > [2vt at b08l02 ~]$ qstat
> >> > 16225.b08l02        STDIN            2vt                     0 R interq
> >> >
> >> > But then it is gone and I never actually get a node.  The job seems to
> >> > wait for another 30 seconds or so and then give the "apparently
> >> > deleted" message.
> >> >
> >> >
> >> > This is what is in the pbs_server log:
> >> >
> >> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> >> > received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> >> > 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> >> > 2vt at b07l01.oic.ornl.gov, sock=9
> >> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> >> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> >> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> >> > from 2vt at b07l01.oic.ornl.gov, sock=9
> >> > 10/18/2006 
> >16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> >> > into interq, state 1 hop 1
> >> > 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> >> > Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> >> > 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> >> > <snip>
> >> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> >> > received from 2vt at b07l01.oic.ornl.gov, sock=11
> >> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> >> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> >> > 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> >> > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> >> > 2vt at b07l01.oic.ornl.gov
> >> >
> >> > Does anyone have any ideas for this?  I'd really appreciate the help -
> >> > the users are getting restless. :P
> >>
> >> We ran into this a few months ago, but we didn't have a chance to fix or
> >> report the problem at the time.  It's definitely a TORQUE-specific bug,
> >> as OpenPBS on our older systems does *not* behave the same way.
> >>
> >> Here's some analysis on the problem that Doug Johnson did when we first
> >> ran into it:
> >> -----
> >> We've run into a problem with interactive jobs and torque-2.0.0p8.
> >> When submitting interactive jobs from hosts other than the node
> >> running pbs_server the interactive job fails.  The end of the line for
> >> the non-interactive job seems to be in,
> >
> >My users have 3 dozen interactive jobs running from non-pbs_server hosts
> >right now.  Though I haven't tested from a compute node, we have
> >multiple user login nodes that work fine.
> >
> >_______________________________________________
> >torqueusers mailing list
> >torqueusers at supercluster.org
> >http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> 
> 
> -- 
> The more compassionate you are, the more generous you can be. The more
> generous you are, the more loving-friendliness you cultivate to help
> the world.
> 
> -Thich Nhat Hanh, "Buddhist Peacework"
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061018/b206bb10/attachment-0001.bin


More information about the torqueusers mailing list