[torqueusers] Help! One Puzzle At a Time...

Gus Correa gus at ldeo.columbia.edu
Wed Sep 7 07:16:03 MDT 2011


Hi Sam

I added your original message below,
so that other people can read it.

Do you have a ${TORQUE}/mom_priv/config file,
pointing to your pbs_server, probably:
$pbsserver  naboo

[Assuming naboo is the server name in ${TORQUE}/sever_name.]

Did you restart pbs_server after you modified 
${TORQUE}/server_priv/nodes, etc?
(service pbs_server restart)

Anything on your /var/log/messages file telling
why pbs_mom dies?

I hope this helps,
Gus

sam oubari wrote:
> Hi Gus,
>  
> I am using pbs_sched and all is one one server.  To clarify, on 
> occasions, jobs stay in Q until I bounce MOM.  I am pretty sure 
> something is wrong with my only node.  Sam.
> ** 
> *Gus Correa* gus at ldeo.columbia.edu 
> <mailto:torqueusers%40supercluster.org?Subject=Re:%20%5Btorqueusers%5D%20Help%21%20One%20Puzzle%20At%20a%20Time...&In-Reply-To=%3C4E662D67.9040805%40ldeo.columbia.edu%3E>wrote 
> on /Tue Sep 6 08:25:43 MDT 2011/ :
>  
> Regarding the long time in Q state after H state.
> If you are using the maui scheduler, this may be due to the default
> defertime of 1 hour.
> In this case, try setting it to less.
> For instance, if you want it to be one minute, add this line:
> DEFERTIME 00:01:00
> to your ${MAUI}/maui.cfg file and restart maui.
> 
> See also:
> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php
> 
> Not sure if I understood it right, but
> for the 'resource temporarily unavailable' problem,
> qnodes is reporting the 'naboo' node as 'down', hence unavailable.
> It may need a reboot.
> 
> I hope this helps,
> Gus Correa
> 
sam oubari wrote:
 > Hello,
 > I hope someone can help:
 > 1) When we were running 2.4.11, every few weeks, pbs_sched would die. We
 > upgraded using a fresh install to 2.5.6 about two months ago, and we
 > configured it like we did with 2.4.11 using:
 > ./configure --enable-docs --disable-dependency-tracking
 > --disable-libtool-lock --with-scp
 > Now, almost every Sunday at 11pm (we do fire up a few jobs but we do
 > that every AM and PM), mom dies or defunct, eg:
 > $ ps -ef|grep pbs
 > root 6704 1 0 Jul13 ? 00:02:18 /usr/local/sbin/pbs_server
 > root 6910 1 0 Jul13 ? 00:00:56 /usr/local/sbin/pbs_sched
 > rpt_devl 8871 10997 0 Jul31 ? 00:00:00 [pbs_mom] <defunct>
 > root 10997 1 4 Jul25 ? 07:48:14 /usr/local/sbin/pbs_mom
 > Usually, at that time, there are 4 jobs waiting to execute to perform
 > clean up on 4 DBs, and that seems to get MOM stuck.
 > See Dump-1 at the bottom. Our current config is shown below Dump-1 as
 > Dump-2.
 > 2) Both 2.4.x and 2.5.x occasionally don't schedule a waiting job, if I
 > recall, it goes from W to Q but not R. When that happens, I force it
 > with QRUN.
 > 3) I manually had created server_priv/nodes (I just made np=40, it used
 > to be 20):
 > # echo "naboo np=40">/var/spool/torque/server_priv/nodes
 > But I still cannot verify within qmgr:
 > # qmgr
 > list nodes
 > No Active Nodes, nothing done.
 > 4) I manually configured by starting with "pbs_server -t create", but
 > now I am missing $TORQUE_HOME/mom_priv/config. For my simple install, is
 > it required?
 > 5) Speaking of qmgr, most the time when I enter it quits without an
 > output after I issue my 1st command. I re-enter immediately, then it
 > accepts all my commands with no problem. This has been true for 2.4.x
 > and 2.5.x.
 > Any ideas? If things don't improve, I am planning to revert back to
 > 2.4.x. Thx! Sam.
 > ------
 > Sam Oubari, Manager of Systems & Application Programming
 > Linn-Benton Community College -- Information Services
 > 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321
 > Tel: 541-917-4355/Fax: 541-917-4379
 > *********** Dump-1:
 > 07/31/2011 22:55:37;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
 > 2.5.6, loglevel = 0
 > 07/31/2011 23:00:00;0001; pbs_mom;Job;TMomFinalizeJob3;job
 > 6780.naboo.linnbenton.edu started, pid = 8294
 > 07/31/2011 23:00:08;0080;
 > pbs_mom;Job;6780.naboo.linnbenton.edu;scan_for_terminated: job
 > 6780.naboo.linnbenton.edu task 1 terminated, sid=8294
 > 07/31/2011 23:00:08;0008; pbs_mom;Job;6780.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:08;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:08;0080; pbs_mom;Job;6780.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:08;0001; pbs_mom;Job;TMomFinalizeJob3;job
 > 6781.naboo.linnbenton.edu started, pid = 8439
 > 07/31/2011 23:00:10;0080;
 > pbs_mom;Job;6781.naboo.linnbenton.edu;scan_for_terminated: job
 > 6781.naboo.linnbenton.edu task 1 terminated, sid=8439
 > 07/31/2011 23:00:10;0008; pbs_mom;Job;6781.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:10;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:10;0080; pbs_mom;Job;6781.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
 > 7141.naboo.linnbenton.edu started, pid = 8579
 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
 > 7143.naboo.linnbenton.edu started, pid = 8582
 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
 > 7146.naboo.linnbenton.edu started, pid = 8589
 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
 > 7147.naboo.linnbenton.edu started, pid = 8603
 > 07/31/2011 23:00:10;0080;
 > pbs_mom;Job;7141.naboo.linnbenton.edu;scan_for_terminated: job
 > 7141.naboo.linnbenton.edu task 1 terminated, sid=8579
 > 07/31/2011 23:00:10;0008; pbs_mom;Job;7141.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:10;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:10;0080; pbs_mom;Job;7141.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7143.naboo.linnbenton.edu;scan_for_terminated: job
 > 7143.naboo.linnbenton.edu task 1 terminated, sid=8582
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7146.naboo.linnbenton.edu;scan_for_terminated: job
 > 7146.naboo.linnbenton.edu task 1 terminated, sid=8589
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7147.naboo.linnbenton.edu;scan_for_terminated: job
 > 7147.naboo.linnbenton.edu task 1 terminated, sid=8603
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7143.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7146.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7143.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7146.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7147.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7115.naboo.linnbenton.edu;scan_for_terminated: job
 > 7115.naboo.linnbenton.edu task 1 terminated, sid=28947
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7147.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7115.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7116.naboo.linnbenton.edu;scan_for_terminated: job
 > 7116.naboo.linnbenton.edu task 1 terminated, sid=28966
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7118.naboo.linnbenton.edu;scan_for_terminated: job
 > 7118.naboo.linnbenton.edu task 1 terminated, sid=29020
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Job;7117.naboo.linnbenton.edu;scan_for_terminated: job
 > 7117.naboo.linnbenton.edu task 1 terminated, sid=29083
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7116.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7118.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7116.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7115.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7118.naboo.linnbenton.edu;obit
 > sent to server
 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7117.naboo.linnbenton.edu;job was
 > terminated
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
 > 07/31/2011 23:00:11;0080;
 > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
 > of while loop
 > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
 > error from job stat
 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7117.naboo.linnbenton.edu;obit
 > sent to server
 > <eof>
 >  
 > *********** Dump-2:
 > # printserverdb
 > ---------------------------------------------------
 > numjobs: 50
 > numque: 5
 > jobidnumber: 615
 > savetm: 1314100391
 > --attributes--
 > scheduling = True
 > max_running = 23
 > total_jobs = 22
 > state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0
 > default_queue = sys_tst
 > log_events = 511
 > mail_from = adm
 > query_other_jobs = False
 > resources_assigned.nodect = 0
 > scheduler_iteration = 600
 > node_check_rate = 150
 > tcp_timeout = 6
 > mom_job_sync = False
 > pbs_version = 2.5.6
 > keep_completed = 600
 > allow_node_submit = True
 > next_job_number = 1
 > net_counter = 7 1 0
 > # qnodes
 > naboo
 > state = free
 > np = 40
 > ntype = cluster
 > jobs = 0/2.naboo.linnbenton.edu, 2/461.naboo.linnbenton.edu,
 > 3/462.naboo.linnbenton.edu, 4/463.naboo.linnbenton.edu,
 > 5/466.naboo.linnbenton.edu, 6/467.naboo.linnbenton.edu,
 > 7/468.naboo.linnbenton.edu
 > status = rectime=1314203130,varattr=,jobs=2.naboo.linnbenton.edu
 > 461.naboo.linnbenton.edu 462.naboo.linnbenton.edu
 > 463.naboo.linnbenton.edu 466.naboo.linnbenton.edu
 > 467.naboo.linnbenton.edu
 > 
468.naboo.linnbenton.edu,state=free,netload=1125010513873,gres=,loadave=1.37,ncpus=4,physmem=17040092kb,availmem=23052344kb,totmem=29739432kb,idletime=71635,nusers=11,nsessions=163,sessions=612 

 > 614 616 618 620 622 624 626 628 630 632 634 636 638 640 642 644 646 648
 > 650 652 659 661 665 678 680 682 684 686 688 690 692 694 696 698 700 702
 > 704 706 708 710 712 716 720 725 727 729 731 733 735 737 739 741 743 745
 > 747 749 751 753 755 757 759 769 771 776 778 780 782 784 786 788 790 792
 > 794 796 798 800 802 804 806 808 810 820 822 829 899 901 903 905 909 5763
 > 911 1547 1648 2569 2635 2691 2753 2814 2878 2932 5839 5875 7864 7985
 > 7987 7989 7993 7995 7997 7999 8001 8003 8005 8007 8009 8011 8013 8015
 > 8017 8019 8021 8023 8025 8027 8029 8129 8131 8163 8165 8167 10447 10505
 > 10562 10618 10706 11904 11966 11981 12022 12080 12433 13937 14899 15031
 > 15282 16152 17383 22451 23720 31671 31673 31675 31677 31683 31708 31712
 > 31979 32003 32088 32091 32102 32116,uname=Linux naboo.linnbenton.edu
 > 2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011
 > x86_64,opsys=linux
 > gpus = 0
 > # qmgr -c "l s"
 > Server naboo.linnbenton.edu
 > server_state = Active
 > scheduling = True
 > max_running = 23
 > total_jobs = 41
 > state_count = Transit:0 Queued:0 Held:3 Waiting:21 Running:7 Exiting:0
 > acl_hosts = naboo.linnbenton.edu,naboo
 > managers =  _usr1_
 > operators =
 > _usr1,_ usr2, ....
 > default_queue = sys_tst
 > log_events = 511
 > mail_from = adm
 > query_other_jobs = False
 > resources_assigned.nodect = 7
 > scheduler_iteration = 600
 > node_check_rate = 150
 > tcp_timeout = 6
 > mom_job_sync = False
 > pbs_version = 2.5.6
 > keep_completed = 600
 > allow_node_submit = True
 > next_job_number = 625
 > net_counter = 6 2 1
 > # qmgr -c "l q sys_ban" # This is our main queue
 > Queue sys_ban
 > queue_type = Execution
 > max_queuable = 300
 > total_jobs = 21
 > state_count = Transit:0 Queued:0 Held:0 Waiting:15 Running:0 Exiting:0
 > max_running = 1
 > resources_default.nodes = 1
 > resources_default.walltime = 168:00:00
 > mtime = Sat Aug 20 05:25:20 2011
 > resources_assigned.nodect = 0
 > enabled = True
 > started = True
 > # qstat -q
 > server: naboo.linnbenton.edu
 > Queue Memory CPU Time Walltime Node Run Que Lm State
 > ---------------- ------ -------- -------- ---- --- --- -- -----
 > sys_ban -- -- -- -- 0 15 1 E R
 > sys_srv -- -- -- -- 7 8 10 E R
 > sys_tst -- -- -- -- 0 1 1 E R
 > sys_ban_quick -- -- -- -- 0 0 1 E R
 > sys_queue -- -- -- -- 0 0 2 E R
 > ----- -----
 > 7 24
 >


More information about the torqueusers mailing list