[torqueusers] Help! One Puzzle At a Time...
Gus Correa
gus at ldeo.columbia.edu
Wed Sep 7 09:39:55 MDT 2011
Hi Sam
One of your error messages complains that ${TORQUE}/mom_priv/config
doesn't exist. Why not create it, then restart.
Can you resolve the short name 'naboo'?
Is it via DNS, /etc/hosts, something else?
Does the short name 'naboo' have an IP address? (ping -c 1 naboo)
Anyway, since this is a single machine,
have you tried to use 'localhost' in the
${torque}/server_name, ${torque}/nodes file
(and in ${TORQUE}/mom_priv/config if you want to create it)
instead of naboo?
(Then restart everything.)
I wonder if this would be a simpler approach for a single machine.
I hope this helps,
Gus Correa
sam oubari wrote:
> Hi Gus,
>
> Again, thank you for helping. I can't use my work email to post, not
> sure why, and Yahoo does not handle listservs will. So please post your
> response instead.
>
> 1) I don't have ${TORQUE}/mom_priv/config. Do I simply create one like
> shown below?
> 2) Yes, I restarted. I am still puzzled why node settings don't show in
> qmgr when I issue "l n" and why they don't seem to stick when I activate?
> 3) I don't have PBS defined as a service, I start/stop from command line.
> 4) New clues:
> In /var/log/messages, when dying (usually once a week):
> Sep 5 23:01:18 naboo pbs_server: LOG_WARNING::Expired credential
> (15022) in send_job, child timed-out attempting to start job
> 3451.naboo.linnbenton.edu
> Sep 5 23:01:18 naboo pbs_server: LOG_ERROR::stream_eof, connection to
> naboo is bad, remote service may be down, message may be corrupt, or
> connection may have been dropped remotely (No error). setting node
> state to down
> When restarting MOM:
> Sep 6 06:26:04 naboo pbs_mom: LOG_ERROR::No such file or directory (2)
> in read_config, fstat: config
> In sched_logs/20110905:
> 09/05/2011 23:10:00;0040; pbs_sched;Job;3451.naboo.linnbenton.edu;Not
> enough of the right type of nodes available
> 09/05/2011 23:13:00;0040;
> pbs_sched;Job;3571.naboo.linnbenton.edu;Draining system to allow
> 3451.naboo.linnbenton.edu to run
> 09/05/2011 23:30:00;0040;
> pbs_sched;Job;3453.naboo.linnbenton.edu;Draining system to allow
> 3451.naboo.linnbenton.edu to run
> Sam.
>
> Gus Correa gus at ldeo.columbia.edu
> <mailto:torqueusers%40supercluster.org?Subject=Re:%20%5Btorqueusers%5D%20Help%21%20One%20Puzzle%20At%20a%20Time...&In-Reply-To=%3C4E676E93.9060907%40ldeo.columbia.edu%3E>
> /Wed Sep 7 07:16:03 MDT 2011/
> Hi Sam I added your original message below, so that other people can
> read it. Do you have a ${TORQUE}/mom_priv/config file, pointing to
> your pbs_server, probably: $pbsserver naboo [Assuming naboo is the
> server name in ${TORQUE}/sever_name.] Did you restart pbs_server
> after you modified ${TORQUE}/server_priv/nodes, etc? (service
> pbs_server restart) Anything on your /var/log/messages file telling
> why pbs_mom dies? I hope this helps, Gus
sam oubari wrote:
> > Hi Gus,
> >
> > I am using pbs_sched and all is one one server. To clarify, on
> > occasions, jobs stay in Q until I bounce MOM. I am pretty sure
> > something is wrong with my only node. Sam.
> > **
> > *Gus Correa* gus at ldeo.columbia.edu
> >
<mailto:torqueusers%40supercluster.org?Subject=Re:%20%5Btorqueusers%5D%20Help%21%20One%20Puzzle%20At%20a%20Time...&In-Reply-To=%3C4E662D67.9040805%40ldeo.columbia.edu%3E>wrote
> > on /Tue Sep 6 08:25:43 MDT 2011/ :
> >
> > Regarding the long time in Q state after H state.
> > If you are using the maui scheduler, this may be due to the default
> > defertime of 1 hour.
> > In this case, try setting it to less.
> > For instance, if you want it to be one minute, add this line:
> > DEFERTIME 00:01:00
> > to your ${MAUI}/maui.cfg file and restart maui.
> >
> > See also:
> > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php
> >
> > Not sure if I understood it right, but
> > for the 'resource temporarily unavailable' problem,
> > qnodes is reporting the 'naboo' node as 'down', hence unavailable.
> > It may need a reboot.
> >
> > I hope this helps,
> > Gus Correa
> >
sam oubari wrote:
> Hello,
> I hope someone can help:
> 1) When we were running 2.4.11, every few weeks, pbs_sched would die. We
> upgraded using a fresh install to 2.5.6 about two months ago, and we
> configured it like we did with 2.4.11 using:
> ./configure --enable-docs --disable-dependency-tracking
> --disable-libtool-lock --with-scp
> Now, almost every Sunday at 11pm (we do fire up a few jobs but we do
> that every AM and PM), mom dies or defunct, eg:
> $ ps -ef|grep pbs
> root 6704 1 0 Jul13 ? 00:02:18 /usr/local/sbin/pbs_server
> root 6910 1 0 Jul13 ? 00:00:56 /usr/local/sbin/pbs_sched
> rpt_devl 8871 10997 0 Jul31 ? 00:00:00 [pbs_mom] <defunct>
> root 10997 1 4 Jul25 ? 07:48:14 /usr/local/sbin/pbs_mom
> Usually, at that time, there are 4 jobs waiting to execute to perform
> clean up on 4 DBs, and that seems to get MOM stuck.
> See Dump-1 at the bottom. Our current config is shown below Dump-1 as
> Dump-2.
> 2) Both 2.4.x and 2.5.x occasionally don't schedule a waiting job, if I
> recall, it goes from W to Q but not R. When that happens, I force it
> with QRUN.
> 3) I manually had created server_priv/nodes (I just made np=40, it used
> to be 20):
> # echo "naboo np=40">/var/spool/torque/server_priv/nodes
> But I still cannot verify within qmgr:
> # qmgr
> list nodes
> No Active Nodes, nothing done.
> 4) I manually configured by starting with "pbs_server -t create", but
> now I am missing $TORQUE_HOME/mom_priv/config. For my simple install, is
> it required?
> 5) Speaking of qmgr, most the time when I enter it quits without an
> output after I issue my 1st command. I re-enter immediately, then it
> accepts all my commands with no problem. This has been true for 2.4.x
> and 2.5.x.
> Any ideas? If things don't improve, I am planning to revert back to
> 2.4.x. Thx! Sam.
> ------
> Sam Oubari, Manager of Systems & Application Programming
> Linn-Benton Community College -- Information Services
> 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321
> Tel: 541-917-4355/Fax: 541-917-4379
> *********** Dump-1:
> 07/31/2011 22:55:37;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.6, loglevel = 0
> 07/31/2011 23:00:00;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 6780.naboo.linnbenton.edu started, pid = 8294
> 07/31/2011 23:00:08;0080;
> pbs_mom;Job;6780.naboo.linnbenton.edu;scan_for_terminated: job
> 6780.naboo.linnbenton.edu task 1 terminated, sid=8294
> 07/31/2011 23:00:08;0008; pbs_mom;Job;6780.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:08;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:08;0080; pbs_mom;Job;6780.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:08;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 6781.naboo.linnbenton.edu started, pid = 8439
> 07/31/2011 23:00:10;0080;
> pbs_mom;Job;6781.naboo.linnbenton.edu;scan_for_terminated: job
> 6781.naboo.linnbenton.edu task 1 terminated, sid=8439
> 07/31/2011 23:00:10;0008; pbs_mom;Job;6781.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:10;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:10;0080; pbs_mom;Job;6781.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 7141.naboo.linnbenton.edu started, pid = 8579
> 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 7143.naboo.linnbenton.edu started, pid = 8582
> 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 7146.naboo.linnbenton.edu started, pid = 8589
> 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 7147.naboo.linnbenton.edu started, pid = 8603
> 07/31/2011 23:00:10;0080;
> pbs_mom;Job;7141.naboo.linnbenton.edu;scan_for_terminated: job
> 7141.naboo.linnbenton.edu task 1 terminated, sid=8579
> 07/31/2011 23:00:10;0008; pbs_mom;Job;7141.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:10;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:10;0080; pbs_mom;Job;7141.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7143.naboo.linnbenton.edu;scan_for_terminated: job
> 7143.naboo.linnbenton.edu task 1 terminated, sid=8582
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7146.naboo.linnbenton.edu;scan_for_terminated: job
> 7146.naboo.linnbenton.edu task 1 terminated, sid=8589
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7147.naboo.linnbenton.edu;scan_for_terminated: job
> 7147.naboo.linnbenton.edu task 1 terminated, sid=8603
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7143.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7146.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7143.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7146.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7147.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7115.naboo.linnbenton.edu;scan_for_terminated: job
> 7115.naboo.linnbenton.edu task 1 terminated, sid=28947
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7147.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7115.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7116.naboo.linnbenton.edu;scan_for_terminated: job
> 7116.naboo.linnbenton.edu task 1 terminated, sid=28966
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7118.naboo.linnbenton.edu;scan_for_terminated: job
> 7118.naboo.linnbenton.edu task 1 terminated, sid=29020
> 07/31/2011 23:00:11;0080;
> pbs_mom;Job;7117.naboo.linnbenton.edu;scan_for_terminated: job
> 7117.naboo.linnbenton.edu task 1 terminated, sid=29083
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7116.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7118.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7116.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7115.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7118.naboo.linnbenton.edu;obit
> sent to server
> 07/31/2011 23:00:11;0008; pbs_mom;Job;7117.naboo.linnbenton.edu;job was
> terminated
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 07/31/2011 23:00:11;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 07/31/2011 23:00:11;0080; pbs_mom;Job;7117.naboo.linnbenton.edu;obit
> sent to server
> <eof>
>
> *********** Dump-2:
> # printserverdb
> ---------------------------------------------------
> numjobs: 50
> numque: 5
> jobidnumber: 615
> savetm: 1314100391
> --attributes--
> scheduling = True
> max_running = 23
> total_jobs = 22
> state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0
> default_queue = sys_tst
> log_events = 511
> mail_from = adm
> query_other_jobs = False
> resources_assigned.nodect = 0
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = False
> pbs_version = 2.5.6
> keep_completed = 600
> allow_node_submit = True
> next_job_number = 1
> net_counter = 7 1 0
> # qnodes
> naboo
> state = free
> np = 40
> ntype = cluster
> jobs = 0/2.naboo.linnbenton.edu, 2/461.naboo.linnbenton.edu,
> 3/462.naboo.linnbenton.edu, 4/463.naboo.linnbenton.edu,
> 5/466.naboo.linnbenton.edu, 6/467.naboo.linnbenton.edu,
> 7/468.naboo.linnbenton.edu
> status = rectime=1314203130,varattr=,jobs=2.naboo.linnbenton.edu
> 461.naboo.linnbenton.edu 462.naboo.linnbenton.edu
> 463.naboo.linnbenton.edu 466.naboo.linnbenton.edu
> 467.naboo.linnbenton.edu
>
468.naboo.linnbenton.edu,state=free,netload=1125010513873,gres=,loadave=1.37,ncpus=4,physmem=17040092kb,availmem=23052344kb,totmem=29739432kb,idletime=71635,nusers=11,nsessions=163,sessions=612
> 614 616 618 620 622 624 626 628 630 632 634 636 638 640 642 644 646 648
> 650 652 659 661 665 678 680 682 684 686 688 690 692 694 696 698 700 702
> 704 706 708 710 712 716 720 725 727 729 731 733 735 737 739 741 743 745
> 747 749 751 753 755 757 759 769 771 776 778 780 782 784 786 788 790 792
> 794 796 798 800 802 804 806 808 810 820 822 829 899 901 903 905 909 5763
> 911 1547 1648 2569 2635 2691 2753 2814 2878 2932 5839 5875 7864 7985
> 7987 7989 7993 7995 7997 7999 8001 8003 8005 8007 8009 8011 8013 8015
> 8017 8019 8021 8023 8025 8027 8029 8129 8131 8163 8165 8167 10447 10505
> 10562 10618 10706 11904 11966 11981 12022 12080 12433 13937 14899 15031
> 15282 16152 17383 22451 23720 31671 31673 31675 31677 31683 31708 31712
> 31979 32003 32088 32091 32102 32116,uname=Linux naboo.linnbenton.edu
> 2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011
> x86_64,opsys=linux
> gpus = 0
> # qmgr -c "l s"
> Server naboo.linnbenton.edu
> server_state = Active
> scheduling = True
> max_running = 23
> total_jobs = 41
> state_count = Transit:0 Queued:0 Held:3 Waiting:21 Running:7 Exiting:0
> acl_hosts = naboo.linnbenton.edu,naboo
> managers = _usr1_
> operators =
> _usr1,_ usr2, ....
> default_queue = sys_tst
> log_events = 511
> mail_from = adm
> query_other_jobs = False
> resources_assigned.nodect = 7
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = False
> pbs_version = 2.5.6
> keep_completed = 600
> allow_node_submit = True
> next_job_number = 625
> net_counter = 6 2 1
> # qmgr -c "l q sys_ban" # This is our main queue
> Queue sys_ban
> queue_type = Execution
> max_queuable = 300
> total_jobs = 21
> state_count = Transit:0 Queued:0 Held:0 Waiting:15 Running:0 Exiting:0
> max_running = 1
> resources_default.nodes = 1
> resources_default.walltime = 168:00:00
> mtime = Sat Aug 20 05:25:20 2011
> resources_assigned.nodect = 0
> enabled = True
> started = True
> # qstat -q
> server: naboo.linnbenton.edu
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> sys_ban -- -- -- -- 0 15 1 E R
> sys_srv -- -- -- -- 7 8 10 E R
> sys_tst -- -- -- -- 0 1 1 E R
> sys_ban_quick -- -- -- -- 0 0 1 E R
> sys_queue -- -- -- -- 0 0 2 E R
> ----- -----
> 7 24
>
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list