[torqueusers] PBS Error: Execution server rejected request

notinh notien notinhnotien7 at hotmail.com
Mon Nov 7 15:44:51 MST 2005


Hi, all.  Thank you very much for helping me and provided useful information 
to upgrade the server.

However, the cluster we have is quite busy at the moment and no node is 
idle, so I have to wait for now.

In the mean time, I still have that weird problem.
After executing the set node node14 state=free, the server got his output.

[root at master server_logs]#qmgr
Max open servers: 4
Qmgr: print node node14
#
# Create nodes and set their properties.
#
#
# Create and define node node14
#
# create node node14    # unsupported operation
set node node14 state = free
set node node14 properties = all
set node node14 properties += ia32
set node node14 properties += computer
set node node14 ntype = cluster
set node node14 status = arch=linux
set node node14 status += uname=Linux node14.stellar.com 2.4.20-31.9bigmem 
#1 SMP Tue Apr 13 17:11:51 EDT 2004 i686
set node node14 status += sessions=? 15201
set node node14 status += nsessions=? 15201
set node node14 status += nusers=0
set node node14 status += idletime=349492
set node node14 status += totmem=1964040kb
set node node14 status += availmem=1454576kb
set node node14 status += physmem=2061780kb
set node node14 status += ncpus=4
set node node14 status += loadave=0.00
set node node14 status += netload=2057650261
set node node14 status += rectime=1131403494

Restarted mom on node14 and here is the log:

pbs_mom;n/a;is_update_stat;is_update_stat: sending to server "uname=Linux 
node14.stellar.com 2.4.20-31.9bigmem #1 SMP Tue Apr 13 17:11:51 EDT 2004 
i686"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "sessions=? 15201"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "nsessions=? 15201"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "nusers=0"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "idletime=349688"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "totmem=1964040kb"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "availmem=1454312kb"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "physmem=2061780kb"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "ncpus=4"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "loadave=0.00"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
is_update_stat
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "netload=2058097529"
11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
Server's log
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '1' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;HELLO received from 
node14
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;sending cluster-addrs to 
node node14

Server's log
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '1' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;HELLO received from 
node14
11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;sending cluster-addrs to 
node node14

Submitted a job to request for 2 CPUs on node14.
qstat -f
Job Id: 8305.master.stellar.com
    Job_Name = Zr
    Job_Owner = notien at master.stellar.com
    job_state = Q
    queue = default
    server = master.stellar.com
    Checkpoint = u
    ctime = Mon Nov  7 14:49:58 2005
    Error_Path = master.stellar.com:/home/notien/testjob5/Zr.e8305

    exec_host = node14/1+node14/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = abe
    Mail_Users = notien at 192.168.0.13
    mtime = Mon Nov  7 14:50:00 2005
    Output_Path = master.stellar.com:/home/notien/testjob5/Zr.o830
        5
    Priority = 0
    qtime = Mon Nov  7 14:49:58 2005
    Rerunable = True
    Resource_List.neednodes = node14:ppn=2
    Resource_List.nodect = 1
    Resource_List.nodes = node14:ppn=2
    Resource_List.walltime = 180:00:00
    substate = 10
    Variable_List = PBS_O_HOME=/home/notien,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=notien,
        
PBS_O_PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11R6/bin
        
:/opt/c3-4/:/usr/local/pbs/bin:/usr/local/pbs/sbin:/usr/local/maui/bin:
        /home/notien/bin,PBS_O_MAIL=/var/spool/mail/notien,
        PBS_O_SHELL=/bin/bash,PBS_O_HOST=master.stellar.com,
        PBS_O_WORKDIR=/home/notien/testjob5,PBS_O_QUEUE=default
    euser = notien
    egroup = notien
    hashname = 8305.master
    queue_rank = 191
    queue_type = E
    comment = Not Running - PBS Error: Execution server rejected request
    etime = Mon Nov  7 14:49:58 2005

server's log after job submitted

11/07/2005 14:47:40;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:47:40;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:47:40;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:47:42;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:48:10;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:48:10;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:48:40;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:48:40;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:48:44;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:48:46;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:49:04;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:49:04;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:49:34;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:49:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:49:37;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:49:58;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:49:58;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:49:58;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:50:09;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:50:11;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:50:19;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:50:19;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:50:49;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:50:49;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:51:13;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:51:13;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:51:13;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:51:15;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:51:43;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:51:43;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:52:13;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:52:13;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14
11/07/2005 14:52:17;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:52:19;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/07/2005 14:52:37;0004;PBS_Server;Svr;is_request;message '4' received from 
node14 (10.0.1.14:1023)
11/07/2005 14:52:37;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
node14

node14 mom's log (the same log like the above but accompaning with these)
11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
rm_request
11/07/2005 15:33:37;0002;   pbs_mom;n/a;is_update_stat;composing status 
update for server

Thank you once again.

>From: garrick <garrick at usc.edu>
>To: torqueusers at supercluster.org
>Subject: Re: [torqueusers] PBS Error: Execution server rejected request
>Date: Sat, 5 Nov 2005 00:16:11 -0800
>
>On Sat, Nov 05, 2005 at 03:16:39AM +0700, notinh notien alleged:
> > Thank Mr. Staples.  Here is the config for the mom. Originally, there 
>was
> > no restricted directives.  The weird thing is the other three cloned 
>nodes
> > with the exact config file, and they are working right now.
>
>I'm at a loss here.  But that old code had a lot of problems with node
>states.  You might just manually set the state in qmgr, 'set node node14
>state=free', and see if they start talking again.
>
>
> > I actually have newer version in place but the cluster are quite busy 
>and I
> > don't have much experience migrating current running jobs to new server. 
>  I
> > found some docs at the site regarding running 2 servers at the same 
>time,
> > but I have not located docs to show how to migrate running jobs to new
> > server and how to replace old with new server with little impact on the
> > jobs.  Please help me on these things.
>
>It's pretty much painless.  Just install the new daemons and restart
>them.  Don't restart MOMs on hosts that have running jobs.
>
>I generally do something like this:
>   kill the scheduler
>   wait a few minutes for all new jobs to complete startup
>   restart pbs_server
>   wait a minute, make sure node and job states are updating correctly
>   restart MOMs on all idle nodes
>   wait a minute, make sure node and job states are updating correctly
>   mark busy nodes offline
>   start the scheduler
>   restart MOMs on offline nodes after their jobs exit.
>
>If you are using maui (or any other software that links to PBS libs), be
>sure it is built against the _new_ TORQUE libs and not the ones from
>your old install.
>
>--
>Garrick Staples, Linux/HPCC Administrator
>University of Southern California


><< attach4 >>




>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



More information about the torqueusers mailing list