[torqueusers] PBS Error: Execution server rejected request

notinh notien notinhnotien7 at hotmail.com
Mon Nov 7 17:53:58 MST 2005


Dear, Mr. Staples.  Thank you very much for your helps.  Here are what I 
found:

[root at master server_logs]#getent hosts master
10.0.1.250      master.stellar.com master
[root at master server_logs]#getent hosts node14
10.0.1.14       node14.stellar.com node14
[root at master server_logs]#getent hosts node10
10.0.1.10       node10.stellar.com node10
[root at master server_logs]#getent hosts node11
10.0.1.11       node11.stellar.com node11
[root at master server_logs]#getent hosts node12
10.0.1.12       node12.stellar.com node12
[root at master server_logs]#getent hosts node13
10.0.1.13       node13.stellar.com node13
[root at master server_logs]#getent hosts node15
10.0.1.15       node15.stellar.com node15
[root at master server_logs]#getent hosts node16
10.0.1.16       node16.stellar.com node16
[root at master server_logs]#getent hosts node01
10.0.1.1        node01.stellar.com node01
[root at master server_logs]#getent hosts node08
10.0.1.8        node08.stellar.com node08

[root at node14 mom_logs]# getent hosts master
10.0.1.250      master.stellar.com master

[root at node15 root]# getent hosts master
10.0.1.250      master.stellar.com master

Everything looks fine.  What's wrong?

Thank you.

>From: "notinh notien" <notinhnotien7 at hotmail.com>
>To: garrick at usc.edu, torqueusers at supercluster.org
>Subject: Re: [torqueusers] PBS Error: Execution server rejected request
>Date: Tue, 08 Nov 2005 05:44:51 +0700
>
>Hi, all.  Thank you very much for helping me and provided useful 
>information to upgrade the server.
>
>However, the cluster we have is quite busy at the moment and no node is 
>idle, so I have to wait for now.
>
>In the mean time, I still have that weird problem.
>After executing the set node node14 state=free, the server got his output.
>
>[root at master server_logs]#qmgr
>Max open servers: 4
>Qmgr: print node node14
>#
># Create nodes and set their properties.
>#
>#
># Create and define node node14
>#
># create node node14    # unsupported operation
>set node node14 state = free
>set node node14 properties = all
>set node node14 properties += ia32
>set node node14 properties += computer
>set node node14 ntype = cluster
>set node node14 status = arch=linux
>set node node14 status += uname=Linux node14.stellar.com 2.4.20-31.9bigmem 
>#1 SMP Tue Apr 13 17:11:51 EDT 2004 i686
>set node node14 status += sessions=? 15201
>set node node14 status += nsessions=? 15201
>set node node14 status += nusers=0
>set node node14 status += idletime=349492
>set node node14 status += totmem=1964040kb
>set node node14 status += availmem=1454576kb
>set node node14 status += physmem=2061780kb
>set node node14 status += ncpus=4
>set node node14 status += loadave=0.00
>set node node14 status += netload=2057650261
>set node node14 status += rectime=1131403494
>
>Restarted mom on node14 and here is the log:
>
>pbs_mom;n/a;is_update_stat;is_update_stat: sending to server "uname=Linux 
>node14.stellar.com 2.4.20-31.9bigmem #1 SMP Tue Apr 13 17:11:51 EDT 2004 
>i686"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "sessions=? 15201"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "nsessions=? 15201"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "nusers=0"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "idletime=349688"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "totmem=1964040kb"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "availmem=1454312kb"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "physmem=2061780kb"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "ncpus=4"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "loadave=0.00"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;setting alarm in 
>is_update_stat
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
>sending to server "netload=2058097529"
>11/07/2005 15:26:52;0002;   pbs_mom;n/a;is_update_stat;status update 
>successfully sent to server
>Server's log
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '1' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;HELLO received from 
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;sending cluster-addrs to 
>node node14
>
>Server's log
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '1' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;HELLO received from 
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;sending cluster-addrs to 
>node node14
>
>Submitted a job to request for 2 CPUs on node14.
>qstat -f
>Job Id: 8305.master.stellar.com
>    Job_Name = Zr
>    Job_Owner = notien at master.stellar.com
>    job_state = Q
>    queue = default
>    server = master.stellar.com
>    Checkpoint = u
>    ctime = Mon Nov  7 14:49:58 2005
>    Error_Path = master.stellar.com:/home/notien/testjob5/Zr.e8305
>
>    exec_host = node14/1+node14/0
>    Hold_Types = n
>    Join_Path = n
>    Keep_Files = n
>    Mail_Points = abe
>    Mail_Users = notien at 192.168.0.13
>    mtime = Mon Nov  7 14:50:00 2005
>    Output_Path = master.stellar.com:/home/notien/testjob5/Zr.o830
>        5
>    Priority = 0
>    qtime = Mon Nov  7 14:49:58 2005
>    Rerunable = True
>    Resource_List.neednodes = node14:ppn=2
>    Resource_List.nodect = 1
>    Resource_List.nodes = node14:ppn=2
>    Resource_List.walltime = 180:00:00
>    substate = 10
>    Variable_List = PBS_O_HOME=/home/notien,PBS_O_LANG=en_US.UTF-8,
>        PBS_O_LOGNAME=notien,
>        
>PBS_O_PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11R6/bin
>        
>:/opt/c3-4/:/usr/local/pbs/bin:/usr/local/pbs/sbin:/usr/local/maui/bin:
>        /home/notien/bin,PBS_O_MAIL=/var/spool/mail/notien,
>        PBS_O_SHELL=/bin/bash,PBS_O_HOST=master.stellar.com,
>        PBS_O_WORKDIR=/home/notien/testjob5,PBS_O_QUEUE=default
>    euser = notien
>    egroup = notien
>    hashname = 8305.master
>    queue_rank = 191
>    queue_type = E
>    comment = Not Running - PBS Error: Execution server rejected request
>    etime = Mon Nov  7 14:49:58 2005
>
>server's log after job submitted
>
>11/07/2005 14:47:40;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:40;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:47:40;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:47:42;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:48:10;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:48:10;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:48:40;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:48:40;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:48:44;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:48:46;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:49:04;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:49:04;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:49:34;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:49:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:49:37;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:49:58;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:49:58;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:49:58;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:50:09;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:50:11;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:50:19;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:50:19;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:50:49;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:50:49;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:51:13;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:51:13;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:51:13;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:51:15;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:51:43;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:51:43;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:52:13;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:52:13;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>11/07/2005 14:52:17;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:52:19;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
>node14 !!!
>11/07/2005 14:52:37;0004;PBS_Server;Svr;is_request;message '4' received 
>from node14 (10.0.1.14:1023)
>11/07/2005 14:52:37;0004;PBS_Server;Svr;is_request;IS_STATUS received from 
>node14
>
>node14 mom's log (the same log like the above but accompaning with these)
>11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
>rm_request
>11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
>rm_request
>11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
>rm_request
>11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
>rm_request
>11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
>rm_request
>11/07/2005 15:33:07;0002;   pbs_mom;n/a;rm_request;setting alarm in 
>rm_request
>11/07/2005 15:33:37;0002;   pbs_mom;n/a;is_update_stat;composing status 
>update for server
>
>Thank you once again.
>
>>From: garrick <garrick at usc.edu>
>>To: torqueusers at supercluster.org
>>Subject: Re: [torqueusers] PBS Error: Execution server rejected request
>>Date: Sat, 5 Nov 2005 00:16:11 -0800
>>
>>On Sat, Nov 05, 2005 at 03:16:39AM +0700, notinh notien alleged:
>> > Thank Mr. Staples.  Here is the config for the mom. Originally, there 
>>was
>> > no restricted directives.  The weird thing is the other three cloned 
>>nodes
>> > with the exact config file, and they are working right now.
>>
>>I'm at a loss here.  But that old code had a lot of problems with node
>>states.  You might just manually set the state in qmgr, 'set node node14
>>state=free', and see if they start talking again.
>>
>>
>> > I actually have newer version in place but the cluster are quite busy 
>>and I
>> > don't have much experience migrating current running jobs to new 
>>server.  I
>> > found some docs at the site regarding running 2 servers at the same 
>>time,
>> > but I have not located docs to show how to migrate running jobs to new
>> > server and how to replace old with new server with little impact on the
>> > jobs.  Please help me on these things.
>>
>>It's pretty much painless.  Just install the new daemons and restart
>>them.  Don't restart MOMs on hosts that have running jobs.
>>
>>I generally do something like this:
>>   kill the scheduler
>>   wait a few minutes for all new jobs to complete startup
>>   restart pbs_server
>>   wait a minute, make sure node and job states are updating correctly
>>   restart MOMs on all idle nodes
>>   wait a minute, make sure node and job states are updating correctly
>>   mark busy nodes offline
>>   start the scheduler
>>   restart MOMs on offline nodes after their jobs exit.
>>
>>If you are using maui (or any other software that links to PBS libs), be
>>sure it is built against the _new_ TORQUE libs and not the ones from
>>your old install.
>>
>>--
>>Garrick Staples, Linux/HPCC Administrator
>>University of Southern California
>
>
>><< attach4 >>
>
>
>
>
>>_______________________________________________
>>torqueusers mailing list
>>torqueusers at supercluster.org
>>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>_________________________________________________________________
>FREE pop-up blocking with the new MSN Toolbar - get it now! 
>http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers

_________________________________________________________________
Don't just search. Find. Check out the new MSN Search! 
http://search.msn.com/



More information about the torqueusers mailing list