[torqueusers] Queue is always stuck

David LeBard david.lebard at asu.edu
Wed Mar 9 16:21:58 MST 2005


Hi Torque folks,

Since this message was left idle for a few weeks, I thought I would
resend it to the users group in hopes someone has seen this problem
before.  


Thanks again,
David



Original message:

I have taken a look at the logs and pbsnodes -a and there are two things
that pop up.  First, in the server logs there are always lines like the
following:

  theochem1.la.asu.edu;unable to run job, MOM rejected
  req_reject;Reject reply code=15041( MSG=send failed, STARTING), aux=0,
  type=15, from Scheduler at theochem1.la.asu.edu,

but MOM doesnt say much (or really anything at all) about the
situation.  The second thing is that I noticed that pbsnodes -a returns
all the normal facts about the nodes, but for each node it always says
ncpus=0, which is obviously false.  I am not sure if this is a problem,
but I thought I would mention it anyway.

Included for completeness are selections from the server logs, the mom
logs, and the output of pbsnodes -a, so if anything looks strange please
let me know.

Thanks,
David

Server Logs:
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0008;PBS_Server;Job;28.theochem1.la.asu.edu;Job
Modified at request of Scheduler at theochem1.la.asu.edu
02/23/2005 23:58:10;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at theochem1.la.asu.edu, sock=9
02/23/2005 23:58:10;0008;PBS_Server;Job;28.theochem1.la.asu.edu;Job Run
at request of Scheduler at theochem1.la.asu.edu
02/23/2005 23:58:12;0008;PBS_Server;Job;28.theochem1.la.asu.edu;unable
to run job, MOM rejected
02/23/2005 23:58:12;0080;PBS_Server;Req;req_reject;Reject reply
code=15041( MSG=send failed, STARTING), aux=0, type=15, from
Scheduler at theochem1.la.asu.edu
02/23/2005 23:58:12;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at theochem1.la.asu.edu, sock=9

Mom Logs:
02/23/2005 08:53:27;0002;   pbs_mom;Svr;Log;Log opened
02/23/2005 08:53:27;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
leaving jobs running, just exiting
02/23/2005 08:53:27;0002;   pbs_mom;Svr;pbs_mom;Is down
02/23/2005 08:53:27;0002;   pbs_mom;Svr;Log;Log closed
02/23/2005 08:53:31;0002;   pbs_mom;Svr;Log;Log opened
02/23/2005 08:53:31;0002;   pbs_mom;n/a;initialize;independent
02/23/2005 08:53:31;0002;   pbs_mom;Svr;pbs_mom;Is up
02/23/2005 08:53:31;0002;   pbs_mom;n/a;is_update_stat;hello sent to
server
02/23/2005 09:08:41;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
leaving jobs running, just exiting
02/23/2005 09:08:41;0002;   pbs_mom;Svr;pbs_mom;Is down
02/23/2005 09:08:41;0002;   pbs_mom;Svr;Log;Log closed
02/23/2005 09:08:43;0002;   pbs_mom;Svr;Log;Log opened
02/23/2005 09:08:43;0002;   pbs_mom;Svr;restricted;10.0.0.1
02/23/2005 09:08:43;0002;   pbs_mom;n/a;initialize;independent
02/23/2005 09:08:43;0002;   pbs_mom;Svr;pbs_mom;Is up
02/23/2005 09:08:43;0002;   pbs_mom;n/a;is_update_stat;hello sent to
server
02/23/2005 09:08:56;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from
addr 10.0.0.1:15001
02/23/2005 09:09:19;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
leaving jobs running, just exiting
02/23/2005 09:09:31;0002;   pbs_mom;Svr;pbs_mom;Is down
02/23/2005 09:09:31;0002;   pbs_mom;Svr;Log;Log closed
02/23/2005 09:10:04;0002;   pbs_mom;Svr;Log;Log opened
02/23/2005 09:10:04;0002;   pbs_mom;Svr;restricted;10.0.0.1
02/23/2005 09:10:04;0002;   pbs_mom;n/a;initialize;independent
02/23/2005 09:10:04;0002;   pbs_mom;Svr;pbs_mom;Is up
02/23/2005 09:10:05;0002;   pbs_mom;n/a;is_update_stat;hello sent to
server
02/23/2005 09:19:14;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
leaving jobs running, just exiting
02/23/2005 09:19:14;0002;   pbs_mom;Svr;pbs_mom;Is down
02/23/2005 09:19:14;0002;   pbs_mom;Svr;Log;Log closed
02/23/2005 09:19:18;0002;   pbs_mom;Svr;Log;Log opened
02/23/2005 09:19:18;0002;   pbs_mom;Svr;restricted;10.0.0.1
02/23/2005 09:19:18;0002;   pbs_mom;Svr;restricted;theochem1.la.asu.edu
02/23/2005 09:19:18;0002;   pbs_mom;n/a;initialize;independent
02/23/2005 09:19:18;0002;   pbs_mom;Svr;pbs_mom;Is up
02/23/2005 09:19:18;0002;   pbs_mom;n/a;is_update_stat;hello sent to
server
02/23/2005 09:19:18;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from
addr 10.0.0.1:15001
02/23/2005 09:19:43;0002;   pbs_mom;n/a;is_update_stat;hello sent to
server

pbsnodes -a:
master
     state = free
     np = 2
     ntype = cluster
     status = arch=linux,uname=Linux theochem1.la.asu.edu 2.4.9-32.5smp
#1 SMP Tue Jun 25 04:03:46 EDT 2002 alpha,sessions=256 31132 31245 14326
14512 32671 10879 15851 15904 32110 32163 32234 1913 2323 2479 2600 8156
8226 12009 12082 22773 22826 10639 21027 21095 21165 22032 22086 22154
22208 22279 24993 25048 25141 11013 27465 27532 27618 27822
28112,nsessions=40,nusers=7,idletime=6013,totmem=5151888kb,availmem=4275696kb,physmem=1023136kb,ncpus=0,loadave=0.00,netload=18446744073416281005,rectime=1109324761

node2
     state = free
     np = 2
     ntype = cluster
     status = arch=linux,uname=Linux node2.cl.theochem.la.asu.edu
2.4.9-32.5smp #1 SMP Tue Jun 25 04:03:46 EDT 2002
alpha,sessions=17787,nsessions=1,nusers=1,idletime=8727995,totmem=3119528kb,availmem=1106248kb,physmem=2054120kb,ncpus=0,loadave=2.07,netload=18446744072445045776,rectime=1109324749

node3
     state = free
     np = 2
     ntype = cluster
     status = arch=linux,uname=Linux node3.cl.theochem.la.asu.edu 2.4.19
#11 SMP Mon Nov 4 12:26:04 EST 2002 alpha,sessions=3038
15283,nsessions=2,nusers=1,idletime=1016922,totmem=4114080kb,availmem=2676176kb,physmem=2065816kb,ncpus=0,loadave=1.00,netload=2039148521,rectime=1109324741

node4
     state = free
     np = 2
     ntype = cluster
     status = arch=linux,uname=Linux node4.cl.theochem.la.asu.edu
2.4.18smp #1 SMP Wed May 29 10:05:45 EDT 2002
alpha,sessions=22304,nsessions=1,nusers=1,idletime=0,totmem=3121144kb,availmem=2104248kb,physmem=1024008kb,ncpus=0,loadave=0.00,netload=411946140,rectime=1109324741

node5
     state = free
     np = 2
     ntype = cluster
     status = arch=linux,uname=Linux node5.cl.theochem.la.asu.edu 2.4.19
#10 SMP Thu Sep 5 16:23:16 EDT 2002 alpha,sessions=? 15201,nsessions=?
15201,nusers=0,idletime=1704925,totmem=4114152kb,availmem=2058856kb,physmem=2065888kb,ncpus=0,loadave=0.00,netload=18446744072160884968,rectime=1109324759



On Mon, 2005-02-21 at 21:32, Garrick Staples wrote:
> On Thu, Feb 17, 2005 at 11:15:22PM -0700, David LeBard alleged:
> > Hi Torque Folks,
> > 
> > I am trying out torque on our alpha cluster, and I have set
everything
> > up according to the manual, however I keep running into a problem
when I
> > try to submit a job.  Basically, when I submit any job to the queue
it
> > goes immediately from status "R" to status "Q", and when I call
"qstat
> > -f", I see the following error message:
> > 
> > 
> > comment = Not Running - PBS Error:  MSG=send failed, STARTING
> 
> I don't know specifically what that is refering to, but is there
anything
> interesting in server's logs, mom's logs, or the scheduler's logs?
> 
> Make sure that 'pbsnodes -a' lists your nodes correctly as "free" and
all of
> the status information.
> 



More information about the torqueusers mailing list