[torqueusers] qsub -I from compute node?

Jerry Smith jdsmit at sandia.gov
Wed Oct 18 14:53:05 MDT 2006


Jen,

Are you by chance running Maui or Moab as well?  We see this problem when
there are incorrect ACLs setup.

Jerry



> 
> Hi All,
> 
> I have users who like to debug using an interactive pbs job.  We have
> 8 nodes designated for job submission and these all work fine when
> submitting batch jobs, but give an error when submitting an
> interactive job...  The only node that will not give an error when
> submitting an interactive job is the node that runs the pbs_server.
> 
> This is what my users (and I) get when submitting on a node that is
> not running the pbs_server:
> 
> [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> [2vt at b07l01 hello-parallel-worlds]$
> 
> Doing a qstat, I breifly see this:
> 
> [2vt at b08l02 ~]$ qstat
> 16225.b08l02        STDIN            2vt                     0 Q interq
> [2vt at b08l02 ~]$ qstat
> 16225.b08l02        STDIN            2vt                     0 R interq
> 
> But then it is gone and I never actually get a node.  The job seems to
> wait for another 30 seconds or so and then give the "apparently
> deleted" message.
> 
> 
> This is what is in the pbs_server log:
> 
> 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> received from 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> from 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> into interq, state 1 hop 1
> 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> <snip>
> 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from 2vt at b07l01.oic.ornl.gov, sock=11
> 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> received from 2vt at b07l01.oic.ornl.gov, sock=9
> 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> 2vt at b07l01.oic.ornl.gov
> 
> Does anyone have any ideas for this?  I'd really appreciate the help -
> the users are getting restless. :P
> 
> Thanks!!
> -Jen
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 




More information about the torqueusers mailing list