[torqueusers] Help! One Puzzle At a Time...

Gus Correa gus at ldeo.columbia.edu
Wed Sep 7 09:39:55 MDT 2011


Hi Sam

One of your error messages complains that ${TORQUE}/mom_priv/config
doesn't exist.  Why not create it, then restart.

Can you resolve the short name 'naboo'?
Is it via DNS, /etc/hosts, something else?
Does the short name 'naboo' have an IP address? (ping -c 1 naboo)

Anyway, since this is a single machine,
have you tried to use 'localhost' in the
${torque}/server_name, ${torque}/nodes file
(and in ${TORQUE}/mom_priv/config if you want to create it)
instead of naboo?
(Then restart everything.)
I wonder if this would be a simpler approach for a single machine.

I hope this helps,
Gus Correa


sam oubari wrote:
> Hi Gus,
>  
> Again, thank you for helping.  I can't use my work email to post, not 
> sure why, and Yahoo does not handle listservs will. So please post your 
> response instead.
> 
> 1) I don't have ${TORQUE}/mom_priv/config.  Do I simply create one like 
> shown below?
> 2) Yes, I restarted.  I am still puzzled why node settings don't show in 
> qmgr when I issue "l n" and why they don't seem to stick when I activate?
> 3) I don't have PBS defined as a service, I start/stop from command line.
> 4) New clues:
> In /var/log/messages, when dying (usually once a week):
> Sep  5 23:01:18 naboo pbs_server: LOG_WARNING::Expired credential 
> (15022) in send_job, child timed-out attempting to start job 
> 3451.naboo.linnbenton.edu
> Sep  5 23:01:18 naboo pbs_server: LOG_ERROR::stream_eof, connection to 
> naboo is bad, remote service may be down, message may be corrupt, or 
> connection may have been dropped remotely (No error).  setting node 
> state to down
> When restarting MOM:
> Sep  6 06:26:04 naboo pbs_mom: LOG_ERROR::No such file or directory (2) 
> in read_config, fstat: config
> In sched_logs/20110905:
> 09/05/2011 23:10:00;0040; pbs_sched;Job;3451.naboo.linnbenton.edu;Not 
> enough of the right type of nodes available
> 09/05/2011 23:13:00;0040; 
> pbs_sched;Job;3571.naboo.linnbenton.edu;Draining system to allow 
> 3451.naboo.linnbenton.edu to run
> 09/05/2011 23:30:00;0040; 
> pbs_sched;Job;3453.naboo.linnbenton.edu;Draining system to allow 
> 3451.naboo.linnbenton.edu to run
> Sam.
> 
>     Gus Correa gus at ldeo.columbia.edu
>     <mailto:torqueusers%40supercluster.org?Subject=Re:%20%5Btorqueusers%5D%20Help%21%20One%20Puzzle%20At%20a%20Time...&In-Reply-To=%3C4E676E93.9060907%40ldeo.columbia.edu%3E>
>     /Wed Sep 7 07:16:03 MDT 2011/
>     Hi Sam I added your original message below, so that other people can
>     read it. Do you have a ${TORQUE}/mom_priv/config file, pointing to
>     your pbs_server, probably: $pbsserver naboo [Assuming naboo is the
>     server name in ${TORQUE}/sever_name.] Did you restart pbs_server
>     after you modified ${TORQUE}/server_priv/nodes, etc? (service
>     pbs_server restart) Anything on your /var/log/messages file telling
>     why pbs_mom dies? I hope this helps, Gus 

sam oubari wrote:
 > > Hi Gus,
 > >
 > > I am using pbs_sched and all is one one server.  To clarify, on
 > > occasions, jobs stay in Q until I bounce MOM.  I am pretty sure
 > > something is wrong with my only node.  Sam.
 > > **
 > > *Gus Correa* gus at ldeo.columbia.edu
 > > 
<mailto:torqueusers%40supercluster.org?Subject=Re:%20%5Btorqueusers%5D%20Help%21%20One%20Puzzle%20At%20a%20Time...&In-Reply-To=%3C4E662D67.9040805%40ldeo.columbia.edu%3E>wrote 

 > > on /Tue Sep 6 08:25:43 MDT 2011/ :
 > >
 > > Regarding the long time in Q state after H state.
 > > If you are using the maui scheduler, this may be due to the default
 > > defertime of 1 hour.
 > > In this case, try setting it to less.
 > > For instance, if you want it to be one minute, add this line:
 > > DEFERTIME 00:01:00
 > > to your ${MAUI}/maui.cfg file and restart maui.
 > >
 > > See also:
 > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php
 > >
 > > Not sure if I understood it right, but
 > > for the 'resource temporarily unavailable' problem,
 > > qnodes is reporting the 'naboo' node as 'down', hence unavailable.
 > > It may need a reboot.
 > >
 > > I hope this helps,
 > > Gus Correa
 > >
sam oubari wrote:
  > Hello,
  > I hope someone can help:
  > 1) When we were running 2.4.11, every few weeks, pbs_sched would die. We
  > upgraded using a fresh install to 2.5.6 about two months ago, and we
  > configured it like we did with 2.4.11 using:
  > ./configure --enable-docs --disable-dependency-tracking
  > --disable-libtool-lock --with-scp
  > Now, almost every Sunday at 11pm (we do fire up a few jobs but we do
  > that every AM and PM), mom dies or defunct, eg:
  > $ ps -ef|grep pbs
  > root 6704 1 0 Jul13 ? 00:02:18 /usr/local/sbin/pbs_server
  > root 6910 1 0 Jul13 ? 00:00:56 /usr/local/sbin/pbs_sched
  > rpt_devl 8871 10997 0 Jul31 ? 00:00:00 [pbs_mom] <defunct>
  > root 10997 1 4 Jul25 ? 07:48:14 /usr/local/sbin/pbs_mom
  > Usually, at that time, there are 4 jobs waiting to execute to perform
  > clean up on 4 DBs, and that seems to get MOM stuck.
  > See Dump-1 at the bottom. Our current config is shown below Dump-1 as
  > Dump-2.
  > 2) Both 2.4.x and 2.5.x occasionally don't schedule a waiting job, if I
  > recall, it goes from W to Q but not R. When that happens, I force it
  > with QRUN.
  > 3) I manually had created server_priv/nodes (I just made np=40, it used
  > to be 20):
  > # echo "naboo np=40">/var/spool/torque/server_priv/nodes
  > But I still cannot verify within qmgr:
  > # qmgr
  > list nodes
  > No Active Nodes, nothing done.
  > 4) I manually configured by starting with "pbs_server -t create", but
  > now I am missing $TORQUE_HOME/mom_priv/config. For my simple install, is
  > it required?
  > 5) Speaking of qmgr, most the time when I enter it quits without an
  > output after I issue my 1st command. I re-enter immediately, then it
  > accepts all my commands with no problem. This has been true for 2.4.x
  > and 2.5.x.
  > Any ideas? If things don't improve, I am planning to revert back to
  > 2.4.x. Thx! Sam.
  > ------
  > Sam Oubari, Manager of Systems & Application Programming
  > Linn-Benton Community College -- Information Services
  > 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321
  > Tel: 541-917-4355/Fax: 541-917-4379
  > *********** Dump-1:
  > 07/31/2011 22:55:37;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
  > 2.5.6, loglevel = 0
  > 07/31/2011 23:00:00;0001; pbs_mom;Job;TMomFinalizeJob3;job
  > 6780.naboo.linnbenton.edu started, pid = 8294
  > 07/31/2011 23:00:08;0080;
  > pbs_mom;Job;6780.naboo.linnbenton.edu;scan_for_terminated: job
  > 6780.naboo.linnbenton.edu task 1 terminated, sid=8294
  > 07/31/2011 23:00:08;0008; pbs_mom;Job;6780.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:08;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:08;0080; pbs_mom;Job;6780.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:08;0001; pbs_mom;Job;TMomFinalizeJob3;job
  > 6781.naboo.linnbenton.edu started, pid = 8439
  > 07/31/2011 23:00:10;0080;
  > pbs_mom;Job;6781.naboo.linnbenton.edu;scan_for_terminated: job
  > 6781.naboo.linnbenton.edu task 1 terminated, sid=8439
  > 07/31/2011 23:00:10;0008; pbs_mom;Job;6781.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:10;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:10;0080; pbs_mom;Job;6781.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
  > 7141.naboo.linnbenton.edu started, pid = 8579
  > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
  > 7143.naboo.linnbenton.edu started, pid = 8582
  > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
  > 7146.naboo.linnbenton.edu started, pid = 8589
  > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job
  > 7147.naboo.linnbenton.edu started, pid = 8603
  > 07/31/2011 23:00:10;0080;
  > pbs_mom;Job;7141.naboo.linnbenton.edu;scan_for_terminated: job
  > 7141.naboo.linnbenton.edu task 1 terminated, sid=8579
  > 07/31/2011 23:00:10;0008; pbs_mom;Job;7141.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:10;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:10;0080; pbs_mom;Job;7141.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7143.naboo.linnbenton.edu;scan_for_terminated: job
  > 7143.naboo.linnbenton.edu task 1 terminated, sid=8582
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7146.naboo.linnbenton.edu;scan_for_terminated: job
  > 7146.naboo.linnbenton.edu task 1 terminated, sid=8589
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7147.naboo.linnbenton.edu;scan_for_terminated: job
  > 7147.naboo.linnbenton.edu task 1 terminated, sid=8603
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7143.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7146.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7143.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7146.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7147.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7115.naboo.linnbenton.edu;scan_for_terminated: job
  > 7115.naboo.linnbenton.edu task 1 terminated, sid=28947
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7147.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7115.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7116.naboo.linnbenton.edu;scan_for_terminated: job
  > 7116.naboo.linnbenton.edu task 1 terminated, sid=28966
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7118.naboo.linnbenton.edu;scan_for_terminated: job
  > 7118.naboo.linnbenton.edu task 1 terminated, sid=29020
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Job;7117.naboo.linnbenton.edu;scan_for_terminated: job
  > 7117.naboo.linnbenton.edu task 1 terminated, sid=29083
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7116.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7118.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7116.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7115.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7118.naboo.linnbenton.edu;obit
  > sent to server
  > 07/31/2011 23:00:11;0008; pbs_mom;Job;7117.naboo.linnbenton.edu;job was
  > terminated
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
  > 07/31/2011 23:00:11;0080;
  > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
  > of while loop
  > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no
  > error from job stat
  > 07/31/2011 23:00:11;0080; pbs_mom;Job;7117.naboo.linnbenton.edu;obit
  > sent to server
  > <eof>
  >  
  > *********** Dump-2:
  > # printserverdb
  > ---------------------------------------------------
  > numjobs: 50
  > numque: 5
  > jobidnumber: 615
  > savetm: 1314100391
  > --attributes--
  > scheduling = True
  > max_running = 23
  > total_jobs = 22
  > state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0
  > default_queue = sys_tst
  > log_events = 511
  > mail_from = adm
  > query_other_jobs = False
  > resources_assigned.nodect = 0
  > scheduler_iteration = 600
  > node_check_rate = 150
  > tcp_timeout = 6
  > mom_job_sync = False
  > pbs_version = 2.5.6
  > keep_completed = 600
  > allow_node_submit = True
  > next_job_number = 1
  > net_counter = 7 1 0
  > # qnodes
  > naboo
  > state = free
  > np = 40
  > ntype = cluster
  > jobs = 0/2.naboo.linnbenton.edu, 2/461.naboo.linnbenton.edu,
  > 3/462.naboo.linnbenton.edu, 4/463.naboo.linnbenton.edu,
  > 5/466.naboo.linnbenton.edu, 6/467.naboo.linnbenton.edu,
  > 7/468.naboo.linnbenton.edu
  > status = rectime=1314203130,varattr=,jobs=2.naboo.linnbenton.edu
  > 461.naboo.linnbenton.edu 462.naboo.linnbenton.edu
  > 463.naboo.linnbenton.edu 466.naboo.linnbenton.edu
  > 467.naboo.linnbenton.edu
  >
468.naboo.linnbenton.edu,state=free,netload=1125010513873,gres=,loadave=1.37,ncpus=4,physmem=17040092kb,availmem=23052344kb,totmem=29739432kb,idletime=71635,nusers=11,nsessions=163,sessions=612 


  > 614 616 618 620 622 624 626 628 630 632 634 636 638 640 642 644 646 648
  > 650 652 659 661 665 678 680 682 684 686 688 690 692 694 696 698 700 702
  > 704 706 708 710 712 716 720 725 727 729 731 733 735 737 739 741 743 745
  > 747 749 751 753 755 757 759 769 771 776 778 780 782 784 786 788 790 792
  > 794 796 798 800 802 804 806 808 810 820 822 829 899 901 903 905 909 5763
  > 911 1547 1648 2569 2635 2691 2753 2814 2878 2932 5839 5875 7864 7985
  > 7987 7989 7993 7995 7997 7999 8001 8003 8005 8007 8009 8011 8013 8015
  > 8017 8019 8021 8023 8025 8027 8029 8129 8131 8163 8165 8167 10447 10505
  > 10562 10618 10706 11904 11966 11981 12022 12080 12433 13937 14899 15031
  > 15282 16152 17383 22451 23720 31671 31673 31675 31677 31683 31708 31712
  > 31979 32003 32088 32091 32102 32116,uname=Linux naboo.linnbenton.edu
  > 2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011
  > x86_64,opsys=linux
  > gpus = 0
  > # qmgr -c "l s"
  > Server naboo.linnbenton.edu
  > server_state = Active
  > scheduling = True
  > max_running = 23
  > total_jobs = 41
  > state_count = Transit:0 Queued:0 Held:3 Waiting:21 Running:7 Exiting:0
  > acl_hosts = naboo.linnbenton.edu,naboo
  > managers =  _usr1_
  > operators =
  > _usr1,_ usr2, ....
  > default_queue = sys_tst
  > log_events = 511
  > mail_from = adm
  > query_other_jobs = False
  > resources_assigned.nodect = 7
  > scheduler_iteration = 600
  > node_check_rate = 150
  > tcp_timeout = 6
  > mom_job_sync = False
  > pbs_version = 2.5.6
  > keep_completed = 600
  > allow_node_submit = True
  > next_job_number = 625
  > net_counter = 6 2 1
  > # qmgr -c "l q sys_ban" # This is our main queue
  > Queue sys_ban
  > queue_type = Execution
  > max_queuable = 300
  > total_jobs = 21
  > state_count = Transit:0 Queued:0 Held:0 Waiting:15 Running:0 Exiting:0
  > max_running = 1
  > resources_default.nodes = 1
  > resources_default.walltime = 168:00:00
  > mtime = Sat Aug 20 05:25:20 2011
  > resources_assigned.nodect = 0
  > enabled = True
  > started = True
  > # qstat -q
  > server: naboo.linnbenton.edu
  > Queue Memory CPU Time Walltime Node Run Que Lm State
  > ---------------- ------ -------- -------- ---- --- --- -- -----
  > sys_ban -- -- -- -- 0 15 1 E R
  > sys_srv -- -- -- -- 7 8 10 E R
  > sys_tst -- -- -- -- 0 1 1 E R
  > sys_ban_quick -- -- -- -- 0 0 1 E R
  > sys_queue -- -- -- -- 0 0 2 E R
  > ----- -----
  > 7 24
  >
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list