[torqueusers] PBS Error: Execution server rejected request
notinh notien
notinhnotien7 at hotmail.com
Mon Nov 7 17:53:58 MST 2005
Dear, Mr. Staples. Thank you very much for your helps. Here are what I
found:
[root at master server_logs]#getent hosts master
10.0.1.250 master.stellar.com master
[root at master server_logs]#getent hosts node14
10.0.1.14 node14.stellar.com node14
[root at master server_logs]#getent hosts node10
10.0.1.10 node10.stellar.com node10
[root at master server_logs]#getent hosts node11
10.0.1.11 node11.stellar.com node11
[root at master server_logs]#getent hosts node12
10.0.1.12 node12.stellar.com node12
[root at master server_logs]#getent hosts node13
10.0.1.13 node13.stellar.com node13
[root at master server_logs]#getent hosts node15
10.0.1.15 node15.stellar.com node15
[root at master server_logs]#getent hosts node16
10.0.1.16 node16.stellar.com node16
[root at master server_logs]#getent hosts node01
10.0.1.1 node01.stellar.com node01
[root at master server_logs]#getent hosts node08
10.0.1.8 node08.stellar.com node08
[root at node14 mom_logs]# getent hosts master
10.0.1.250 master.stellar.com master
[root at node15 root]# getent hosts master
10.0.1.250 master.stellar.com master
Everything looks fine. What's wrong?
Thank you.
>From: "notinh notien" <notinhnotien7 at hotmail.com>
>To: garrick at usc.edu, torqueusers at supercluster.org
>Subject: Re: [torqueusers] PBS Error: Execution server rejected request
>Date: Tue, 08 Nov 2005 05:44:51 +0700
>
>Hi, all. Thank you very much for helping me and provided useful
>information to upgrade the server.
>
>However, the cluster we have is quite busy at the moment and no node is
>idle, so I have to wait for now.
>
>In the mean time, I still have that weird problem.
>After executing the set node node14 state=free, the server got his output.
>
>[root at master server_logs]#qmgr
>Max open servers: 4
>Qmgr: print node node14
>#
># Create nodes and set their properties.
>#
>#
># Create and define node node14
>#
># create node node14 # unsupported operation
>set node node14 state = free
>set node node14 properties = all
>set node node14 properties += ia32
>set node node14 properties += computer
>set node node14 ntype = cluster
>set node node14 status = arch=linux
>set node node14 status += uname=Linux node14.stellar.com 2.4.20-31.9bigmem
>#1 SMP Tue Apr 13 17:11:51 EDT 2004 i686
>set node node14 status += sessions=? 15201
>set node node14 status += nsessions=? 15201
>set node node14 status += nusers=0
>set node node14 status += idletime=349492
>set node node14 status += totmem=1964040kb
>set node node14 status += availmem=1454576kb
>set node node14 status += physmem=2061780kb
>set node node14 status += ncpus=4
>set node node14 status += loadave=0.00
>set node node14 status += netload=2057650261
>set node node14 status += rectime=1131403494
>
>Restarted mom on node14 and here is the log:
>
>pbs_mom;n/a;is_update_stat;is_update_stat: sending to server "uname=Linux
>node14.stellar.com 2.4.20-31.9bigmem #1 SMP Tue Apr 13 17:11:51 EDT 2004
>i686"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "sessions=? 15201"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "nsessions=? 15201"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "nusers=0"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "idletime=349688"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "totmem=1964040kb"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "availmem=1454312kb"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "physmem=2061780kb"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "ncpus=4"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "loadave=0.00"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;setting alarm in
>is_update_stat
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
>sending to server "netload=2058097529"
>11/07/2005 15:26:52;0002; pbs_mom;n/a;is_update_stat;status update
>successfully sent to server
>Server's log
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '1' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;HELLO received from
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;sending cluster-addrs to
>node node14
>
>Server's log
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;message '1' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;HELLO received from
>node14
>11/07/2005 14:47:15;0004;PBS_Server;Svr;is_request;sending cluster-addrs to
>node node14
>
>Submitted a job to request for 2 CPUs on node14.
>qstat -f
>Job Id: 8305.master.stellar.com
> Job_Name = Zr
> Job_Owner = notien at master.stellar.com
> job_state = Q
> queue = default
> server = master.stellar.com
> Checkpoint = u
> ctime = Mon Nov 7 14:49:58 2005
> Error_Path = master.stellar.com:/home/notien/testjob5/Zr.e8305
>
> exec_host = node14/1+node14/0
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = abe
> Mail_Users = notien at 192.168.0.13
> mtime = Mon Nov 7 14:50:00 2005
> Output_Path = master.stellar.com:/home/notien/testjob5/Zr.o830
> 5
> Priority = 0
> qtime = Mon Nov 7 14:49:58 2005
> Rerunable = True
> Resource_List.neednodes = node14:ppn=2
> Resource_List.nodect = 1
> Resource_List.nodes = node14:ppn=2
> Resource_List.walltime = 180:00:00
> substate = 10
> Variable_List = PBS_O_HOME=/home/notien,PBS_O_LANG=en_US.UTF-8,
> PBS_O_LOGNAME=notien,
>
>PBS_O_PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11R6/bin
>
>:/opt/c3-4/:/usr/local/pbs/bin:/usr/local/pbs/sbin:/usr/local/maui/bin:
> /home/notien/bin,PBS_O_MAIL=/var/spool/mail/notien,
> PBS_O_SHELL=/bin/bash,PBS_O_HOST=master.stellar.com,
> PBS_O_WORKDIR=/home/notien/testjob5,PBS_O_QUEUE=default
> euser = notien
> egroup = notien
> hashname = 8305.master
> queue_rank = 191
> queue_type = E
> comment = Not Running - PBS Error: Execution server rejected request
> etime = Mon Nov 7 14:49:58 2005
>
>server's log after job submitted
>
>11/07/2005 14:47:40;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:47:40;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:47:40;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:47:42;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:48:10;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:48:10;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:48:40;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:48:40;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:48:44;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:48:46;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:49:04;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:49:04;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:49:34;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:49:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:49:37;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:49:58;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:49:58;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:49:58;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:50:09;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:50:11;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:50:19;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:50:19;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:50:49;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:50:49;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:51:13;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:51:13;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:51:13;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:51:15;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:51:43;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:51:43;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:52:13;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:52:13;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>11/07/2005 14:52:17;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:52:19;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
>node14 !!!
>11/07/2005 14:52:37;0004;PBS_Server;Svr;is_request;message '4' received
>from node14 (10.0.1.14:1023)
>11/07/2005 14:52:37;0004;PBS_Server;Svr;is_request;IS_STATUS received from
>node14
>
>node14 mom's log (the same log like the above but accompaning with these)
>11/07/2005 15:33:07;0002; pbs_mom;n/a;rm_request;setting alarm in
>rm_request
>11/07/2005 15:33:07;0002; pbs_mom;n/a;rm_request;setting alarm in
>rm_request
>11/07/2005 15:33:07;0002; pbs_mom;n/a;rm_request;setting alarm in
>rm_request
>11/07/2005 15:33:07;0002; pbs_mom;n/a;rm_request;setting alarm in
>rm_request
>11/07/2005 15:33:07;0002; pbs_mom;n/a;rm_request;setting alarm in
>rm_request
>11/07/2005 15:33:07;0002; pbs_mom;n/a;rm_request;setting alarm in
>rm_request
>11/07/2005 15:33:37;0002; pbs_mom;n/a;is_update_stat;composing status
>update for server
>
>Thank you once again.
>
>>From: garrick <garrick at usc.edu>
>>To: torqueusers at supercluster.org
>>Subject: Re: [torqueusers] PBS Error: Execution server rejected request
>>Date: Sat, 5 Nov 2005 00:16:11 -0800
>>
>>On Sat, Nov 05, 2005 at 03:16:39AM +0700, notinh notien alleged:
>> > Thank Mr. Staples. Here is the config for the mom. Originally, there
>>was
>> > no restricted directives. The weird thing is the other three cloned
>>nodes
>> > with the exact config file, and they are working right now.
>>
>>I'm at a loss here. But that old code had a lot of problems with node
>>states. You might just manually set the state in qmgr, 'set node node14
>>state=free', and see if they start talking again.
>>
>>
>> > I actually have newer version in place but the cluster are quite busy
>>and I
>> > don't have much experience migrating current running jobs to new
>>server. I
>> > found some docs at the site regarding running 2 servers at the same
>>time,
>> > but I have not located docs to show how to migrate running jobs to new
>> > server and how to replace old with new server with little impact on the
>> > jobs. Please help me on these things.
>>
>>It's pretty much painless. Just install the new daemons and restart
>>them. Don't restart MOMs on hosts that have running jobs.
>>
>>I generally do something like this:
>> kill the scheduler
>> wait a few minutes for all new jobs to complete startup
>> restart pbs_server
>> wait a minute, make sure node and job states are updating correctly
>> restart MOMs on all idle nodes
>> wait a minute, make sure node and job states are updating correctly
>> mark busy nodes offline
>> start the scheduler
>> restart MOMs on offline nodes after their jobs exit.
>>
>>If you are using maui (or any other software that links to PBS libs), be
>>sure it is built against the _new_ TORQUE libs and not the ones from
>>your old install.
>>
>>--
>>Garrick Staples, Linux/HPCC Administrator
>>University of Southern California
>
>
>><< attach4 >>
>
>
>
>
>>_______________________________________________
>>torqueusers mailing list
>>torqueusers at supercluster.org
>>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>_________________________________________________________________
>FREE pop-up blocking with the new MSN Toolbar - get it now!
>http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
_________________________________________________________________
Don't just search. Find. Check out the new MSN Search!
http://search.msn.com/
More information about the torqueusers
mailing list