[torqueusers] qsub -I from compute node?

Troy Baer troy at osc.edu
Wed Oct 18 15:20:07 MDT 2006


On Wed, 2006-10-18 at 16:36 -0400, Aquarijen wrote:
> I have users who like to debug using an interactive pbs job.  We have
> 8 nodes designated for job submission and these all work fine when
> submitting batch jobs, but give an error when submitting an
> interactive job...  The only node that will not give an error when
> submitting an interactive job is the node that runs the pbs_server.
> 
> This is what my users (and I) get when submitting on a node that is
> not running the pbs_server:
> 
> [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> [2vt at b07l01 hello-parallel-worlds]$
> 
> Doing a qstat, I breifly see this:
> 
> [2vt at b08l02 ~]$ qstat
> 16225.b08l02        STDIN            2vt                     0 Q interq
> [2vt at b08l02 ~]$ qstat
> 16225.b08l02        STDIN            2vt                     0 R interq
> 
> But then it is gone and I never actually get a node.  The job seems to
> wait for another 30 seconds or so and then give the "apparently
> deleted" message.
> 
> 
> This is what is in the pbs_server log:
> 
> 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> received from 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> from 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> into interq, state 1 hop 1
> 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> <snip>
> 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from 2vt at b07l01.oic.ornl.gov, sock=11
> 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> received from 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> 2vt at b07l01.oic.ornl.gov
> 
> Does anyone have any ideas for this?  I'd really appreciate the help -
> the users are getting restless. :P

We ran into this a few months ago, but we didn't have a chance to fix or
report the problem at the time.  It's definitely a TORQUE-specific bug,
as OpenPBS on our older systems does *not* behave the same way.

Here's some analysis on the problem that Doug Johnson did when we first
ran into it:
-----
We've run into a problem with interactive jobs and torque-2.0.0p8.
When submitting interactive jobs from hosts other than the node
running pbs_server the interactive job fails.  The end of the line for
the non-interactive job seems to be in,

start_exec.c:TMomFinalizeChild -> conn_qsub -> client_to_svr

The two tests were 

 1.) interactive job from host gri.sf (where pbs server is running)
     host ip is 192.148.250.126
 2.) interactive job from kodos.sf,
     host ip is 192.148.250.127

The salient part of strace of the pbs_mom during job startup is,

1.)

4990  connect(3, {sa_family=AF_INET, sin_port=htons(34250),
sin_addr=inet_addr("
192.148.250.126")}, 16) = -1 EINPROGRESS (Operation now in progress)
4990  select(4, NULL, [3], NULL, {5, 0}) = 1 (out [3], left {5, 0})
4990  getsockopt(3, SOL_SOCKET, SO_ERROR, [4294967296], [4]) = 0
4990  fcntl(3, F_GETFL)                 = 0x802 (flags O_RDWR|
O_NONBLOCK)
4990  fcntl(3, F_SETFL, O_RDWR)         = 0
4990  write(3, "940.gri.sf.osc.edu\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 80)
= 80
4990  fcntl(3, F_GETFL)                 = 0x2 (flags O_RDWR)


2.)

4828  connect(3, {sa_family=AF_INET, sin_port=htons(52559),
sin_addr=inet_addr("
192.148.250.126")}, 16) = -1 EINPROGRESS (Operation now in progress)
4828  select(4, NULL, [3], NULL, {5, 0}) = 1 (out [3], left {5, 0})
4828  getsockopt(3, SOL_SOCKET, SO_ERROR, [4294967407], [4]) = 0
4828  close(3)                          = 0
4828  ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fbfff9e60) = -1 ENOTTY
(Inappro


Why is the second case still trying to connect to qsub on gri.sf, the
job came from kodos?  Looking at this it appears bad attributes are
being put into the job structure.
-----

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701



More information about the torqueusers mailing list