[torqueusers] Help! Old Puzzle...
sam oubari
soubari at yahoo.com
Thu Aug 2 08:40:48 MDT 2012
Hello All,
I am still having a puzzle where a job does not start when its time arrives. It only impacts a repeating job on one queue that re-qsubs itself at end of each run at 10 or 30 mins intervals. About a couple times a week, it will get stuck at Q. Always happens during work hours, mostly before 3pm, and many times around the supposedly slow lunch hour. In the server_logs, there is odd entry a minute or two before scheduled start:
07/09/2012 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of rpt_prod at naboo.linnbenton.edu
qstat shows Hold_Types changing from n to o. When it happens, we simply issue QRUN on the stuck job. We average about a 1000 qsubs per day mostly using two queues (most are small jobs, 1 minute or less) . Restarting TORQUE weekly did not help. We have a busy but very simple TORQUE 2.5.6 environment (No external nodes/users, all local in a VM host under Oracle VM 2.2.2):
# uname -a
Linux naboo.linnbenton.edu 2.6.18-274.7.1.0.1.el5 #1 SMP Thu Oct 20 22:20:30 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
# qstat -q
server: naboo.linnbenton.edu
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
sys_ban -- -- -- -- 1 17 1 E R
sys_srv -- -- -- -- 8 8 10 E R
sys_tst -- -- -- -- 0 4 1 E R
sys_ban_quick -- -- -- -- 0 0 1 E R
----- -----
9 29
# qmgr -c "list que sys_ban"
Queue sys_ban
queue_type = Execution
max_queuable = 300
total_jobs = 19
state_count = Transit:0 Queued:0 Held:0 Waiting:18 Running:0 Exiting:0
max_running = 1
resources_default.nodes = 1
resources_default.walltime = 168:00:00
mtime = Sat Jul 28 01:36:45 2012
resources_assigned.nodect = 0
enabled = True
started = True
# ps -ef|grep pbs
root 8860 1 0 Jul27 ? 00:03:32 /usr/local/sbin/pbs_mom
root 8865 1 0 Jul27 ? 00:00:44 /usr/local/sbin/pbs_server
root 8867 1 0 Jul27 ? 00:00:15 /usr/local/sbin/pbs_sched
During installs, I issue:
./configure --enable-docs --disable-dependency-tracking --disable-libtool-lock --with-scp # USED SINCE 2.4.5
We've upgraded several times and I am running out of ideas, so if you have a similar environment that works, I would love to see your settings? For example, what options did you 'configure' with?
It was suggested to use gdb on MOM, but have not installed gdb yet.
Thank you, Sam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120802/e6d01260/attachment.html
More information about the torqueusers
mailing list