[torqueusers] Help! One Puzzle At a Time... * update#2 *

sam oubari soubari at yahoo.com
Wed Sep 14 21:04:41 MDT 2011

I am using 2.5.6 with pbs_sched all is running 'local', I am still having problems and here is a recap:
1) A repeating job (it re-qsub a static script at the end of each run to re-launch in 10 or 30 mins), will get stuck at Q a couple times a week.  In server_logs, there is odd coinciding entry:
09/09/2011 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of 
rpt_prod at naboo.linnbenton.edu 
qstat shows Hold_Types changing from n to o.
2) MOM dies about once a week, clues from /var/log/messages:
Sep 14 11:33:22 naboo kernel: pbs_mom[26533]: segfault at 0000790100007868 rip 000000000043136b rsp 00007fff898a3e80 error 4
I got this after re-start:
Sep 14 11:41:20 naboo pbs_mom: LOG_ERROR::Invalid argument (22) in rm_request, write string failed Supporting protocol failure  message refused from port 1021 addr
Sometimes, I get:
Sep 13 08:29:29 naboo pbs_mom: LOG_ALERT::mom_server_valid_message_source, bad connect from - unauthorized server

I am running Redhat 5.6 64-bit, we have 4 queues (max_running = 1), and we average about a 1000 qsubs per day (mostly small jobs, 1 minute or less).  When we were 2.4.11, MOM ran much better.  I am running out of ideas, so if you have a similar environment that works, I would love to see your settings.  For example, what options did you 'configure' with?
Thank you, Sam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/8afbba7e/attachment-0001.html 

More information about the torqueusers mailing list