[torqueusers] Help! Old Puzzle...

sam oubari soubari at yahoo.com
Thu Aug 2 08:40:48 MDT 2012


Hello All,
 
I am still having a puzzle where a job does not start when its time arrives.  It only impacts a repeating job on one queue that re-qsubs itself at end of each run at 10 or 30 mins intervals.  About a couple times a week, it will get stuck at Q.  Always happens during work hours, mostly before 3pm, and many times around the supposedly slow lunch hour.  In the server_logs, there is odd entry a minute or two before scheduled start:
 
07/09/2012 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of rpt_prod at naboo.linnbenton.edu
 
qstat shows Hold_Types changing from n to o.  When it happens, we simply issue QRUN on the stuck job. We average about a 1000 qsubs per day mostly using two queues (most are small jobs, 1 minute or less) .  Restarting TORQUE weekly did not help.  We have a busy but very simple TORQUE 2.5.6 environment (No external nodes/users, all local in a VM host under Oracle VM 2.2.2):
 
# uname -a
Linux naboo.linnbenton.edu 2.6.18-274.7.1.0.1.el5 #1 SMP Thu Oct 20 22:20:30 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

# qstat -q
server: naboo.linnbenton.edu
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
sys_ban            --      --       --      --    1  17  1   E R
sys_srv            --      --       --      --    8   8 10   E R
sys_tst            --      --       --      --    0   4  1   E R
sys_ban_quick      --      --       --      --    0   0  1   E R
                                               ----- -----
                                                   9    29
# qmgr -c "list que sys_ban"
Queue sys_ban
        queue_type = Execution
        max_queuable = 300
        total_jobs = 19
        state_count = Transit:0 Queued:0 Held:0 Waiting:18 Running:0 Exiting:0
        max_running = 1
        resources_default.nodes = 1
        resources_default.walltime = 168:00:00
        mtime = Sat Jul 28 01:36:45 2012
        resources_assigned.nodect = 0
        enabled = True
        started = True
 
# ps -ef|grep pbs
root      8860     1  0 Jul27 ?        00:03:32 /usr/local/sbin/pbs_mom
root      8865     1  0 Jul27 ?        00:00:44 /usr/local/sbin/pbs_server
root      8867     1  0 Jul27 ?        00:00:15 /usr/local/sbin/pbs_sched

During installs, I issue:
./configure --enable-docs --disable-dependency-tracking --disable-libtool-lock --with-scp  # USED SINCE 2.4.5

We've upgraded several times and I am running out of ideas, so if you have a similar environment that works, I would love to see your settings?  For example, what options did you 'configure' with?
 
It was suggested to use gdb on MOM, but have not installed gdb yet.

Thank you, Sam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120802/e6d01260/attachment.html 


More information about the torqueusers mailing list