[torqueusers] pbs_mom and server issues

scoggins JScoggins at lbl.gov
Fri Aug 29 10:25:16 MDT 2008


There are multiple interfaces but only one is connected to the network.

Thanks

Jackie


On Aug 29, 2008, at 1:29 AM, rishi pathak wrote:

> Do your systems have multiple interfaces.If so then check if same  
> network is used for communication.
>
> On Fri, Aug 29, 2008 at 6:06 AM, scoggins <jscoggins at lbl.gov> wrote:
> Torque 2.1.3 problem:
>
> I am getting the following error message when I qsub a job:
>
> Message[0] job cannot be started on RM sched-00  - cannot set  
> hostlist: cannot set job '98.sched-00 ' attr  
> 'Resource_List:neednodes' to 'n0000.ikea:ppn=4 
> +n0001.ikea:ppn=4' (rc: 15070 'Server could not connect to MOM')
>
>
> I can not figure out why.
>
> I ran pbs_iff -t n0000.ikea 15002 and I get the following error:
> ...
>
>
> poll([{fd=3, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
> fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR| 
> O_NONBLOCK)
> read(3, "+2+15+15005+0+72+41Unknown reque"..., 262144) = 60
> write(2, "pbs_iff: Unknown request MSG=can"..., 51pbs_iff: Unknown  
> request MSG=cannot decode message
> ) = 51
> exit_group(1)                           = ?
>
>
>
> PBS commands output:
>
> pbsnodes -a n0000.ikea
>
> n0000.ikea
>     state = free
>     np = 8
>     properties = ikea,quadcore
>     ntype = cluster
>     status = opsys=linux,uname=Linux n0000.ikea 2.6.18-92.1.10.el5  
> #1 SMP Tue Aug 5 07:42:41 EDT 2008 x86_64,sessions=? 0,nsessions=?  
> 0,nusers=0,idletime=784246,totmem=48453372kb,availmem=48342900kb,physm 
> em=16443868kb,ncpus=8,loadave=0.00,netload=98910831,state=free,jobs=,v 
> arattr=,rectime=1219968506
>
>
> momctl -h n0000.ikea -d 9
>
> Host: n0000.ikea/n0000.ikea   Version: 2.3.1   PID: 27784
> Server[0]: sched-00 (10.0.0.30:15001)
>  Init Msgs Received:     2 hellos/2 cluster-addrs
>  Init Msgs Sent:         3 hellos
>  Last Msg From Server:   2620 seconds (CLUSTER_ADDRS)
>  Last Msg To Server:     10 seconds
> HomeDirectory:          /var/spool/torque/ikea/n0000/mom_priv
> stdout/stderr spool directory: '/var/spool/torque/ikea/n0000/ 
> spool/' (3472979 blocks available)
> NOTE:  syslog enabled
> HomeDirectory:          /var/spool/torque/ikea/n0000/mom_priv
> MOM active:             6142 seconds
> Server Update Interval: 45 seconds
> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    RPP
> MemLocked:              TRUE  (mlock)
> TCP Timeout:            20 seconds
> Prolog:                 /var/spool/torque/ikea/n0000/mom_priv/ 
> prologue (disabled)
> Alarm Time:             0 of 10 seconds
> Trusted Client List:     
> 10.0.2.9,10.0.2.7,10.0.2.6,10.0.2.5,10.0.2.4,10.0.2.3,10.0.2.2,10.0.2. 
> 1,10.0.2.0,10.0.0.30,10.0.7.18,10.0.7.17,10.0.7.16,10.0.7.15,10.0.7.14 
> , 
> 10.0.7.13,10.0.7.12,10.0.7.11,10.0.7.10,10.0.7.9,10.0.7.8,10.0.7.7,10. 
> 0.7.6,10.0.7.5,10.0.7.4,10.0.7.3,10.0.7.2,10.0.7.1,10.0.7.0,127.0.0.1
> Copy Command:           /usr/bin/scp -rpB
> NOTE:  no local jobs detected
>
> diagnostics complete
>
> Here is what the server_logs are saying:
>
> 08/28/2008 17:33:09;0001;PBS_Server;Req;;Server could not connect  
> to MOM
> 08/28/2008 17:33:09;0080;PBS_Server;Req;req_reject;Reject reply  
> code=15070(Server could not connect to MOM), aux=0, type=ModifyJob,  
> from root at sched-00
> 08/28/2008 17:33:09;0008;PBS_Server;Job;101.sched-00;Job Modified  
> at request of root at sched-00
>
>
>
> Jobs stay queued and checkjob shows:
>
> BLOCK MSG: job hold active - Batch (recorded at last scheduling  
> iteration)
> Message[0] job cannot be started on RM sched-00.scs.lbl.gov -  
> cannot set hostlist: cannot set job '101.sched-00' attr  
> 'Resource_List:neednodes' to 'n0000.ikea:ppn=4 
> +n0001.ikea:ppn=4' (rc: 15070 'Server could not connect to MOM')
>
> Message[1] cannot start job on reserved resources - job cannot be  
> started on RM sched-00 - cannot set hostlist: cannot set job  
> '101.sched-00' attr 'Resource_List:neednodes' to 'n0000.ikea:ppn=4 
> +n0001.ikea:ppn=4' (rc: 15070 'Server could not connect to MOM')
>
> Any help would be much appreciated.
>
> Thanks
>
> Jackie
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> -- 
> Regards--
> Rishi Pathak
> Pune-Maharastra
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080829/36e7736a/attachment.html


More information about the torqueusers mailing list