[torqueusers] Help! One Puzzle At a Time... * update#2 *

Ken Nielson knielson at adaptivecomputing.com
Wed Sep 14 21:18:18 MDT 2011


----- Original Message -----
> From: "sam oubari" <soubari at yahoo.com>
> To: torqueusers at supercluster.org
> Sent: Wednesday, September 14, 2011 9:04:41 PM
> Subject: Re: [torqueusers] Help! One Puzzle At a Time...  * update#2 *
> 
> 
> 
> 
> Hi,
> 
> I am using 2.5.6 with pbs_sched all is running 'local', I am still
> having problems and here is a recap:
> 
> 1) A repeating job (it re-qsub a static script at the end of each run
> to re-launch in 10 or 30 mins), will get stuck at Q a couple times a
> week. In server_logs, there is odd coinciding entry:
> 
> 09/09/2011 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job
> Modified at request of
> rpt_prod at naboo.linnbenton.edu
> 
> qstat shows Hold_Types changing from n to o.
> 
> 2) MOM dies about once a week, clues from /var/log/messages:
> 
> Sep 14 11:33:22 naboo kernel: pbs_mom[26533]: segfault at
> 0000790100007868 rip 000000000043136b rsp 00007fff898a3e80 error 4
> 
> I got this after re-start:
> Sep 14 11:41:20 naboo pbs_mom: LOG_ERROR::Invalid argument (22) in
> rm_request, write string failed Supporting protocol failure message
> refused from port 1021 addr 127.0.0.1
> 
> Sometimes, I get:
> Sep 13 08:29:29 naboo pbs_mom:
> LOG_ALERT::mom_server_valid_message_source, bad connect from
> 127.0.0.1:1022 - unauthorized server
> 
> I am running Redhat 5.6 64-bit, we have 4 queues (max_running = 1),
> and we average about a 1000 qsubs per day (mostly small jobs, 1
> minute or less). When we were 2.4.11, MOM ran much better. I am
> running out of ideas, so if you have a similar environment that
> works, I would love to see your settings. For example, what options
> did you 'configure' with?
> 
> Thank you, Sam.

Sam,

Have you tried configuring TORQUE using --with-debug and then starting the MOM with gdb to see where the segfault occurs?

Ken


More information about the torqueusers mailing list