[torqueusers] Help! One Puzzle At a Time... * update#2 *
soubari at yahoo.com
Wed Sep 14 21:04:41 MDT 2011
I am using 2.5.6 with pbs_sched all is running 'local', I am still having problems and here is a recap:
1) A repeating job (it re-qsub a static script at the end of each run to re-launch in 10 or 30 mins), will get stuck at Q a couple times a week. In server_logs, there is odd coinciding entry:
09/09/2011 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of
rpt_prod at naboo.linnbenton.edu
qstat shows Hold_Types changing from n to o.
2) MOM dies about once a week, clues from /var/log/messages:
Sep 14 11:33:22 naboo kernel: pbs_mom: segfault at 0000790100007868 rip 000000000043136b rsp 00007fff898a3e80 error 4
I got this after re-start:
Sep 14 11:41:20 naboo pbs_mom: LOG_ERROR::Invalid argument (22) in rm_request, write string failed Supporting protocol failure message refused from port 1021 addr 127.0.0.1
Sometimes, I get:
Sep 13 08:29:29 naboo pbs_mom: LOG_ALERT::mom_server_valid_message_source, bad connect from 127.0.0.1:1022 - unauthorized server
I am running Redhat 5.6 64-bit, we have 4 queues (max_running = 1), and we average about a 1000 qsubs per day (mostly small jobs, 1 minute or less). When we were 2.4.11, MOM ran much better. I am running out of ideas, so if you have a similar environment that works, I would love to see your settings. For example, what options did you 'configure' with?
Thank you, Sam.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers