[torqueusers] how to (re)start mom without killing jobs?

Martin Siegert siegert at sfu.ca
Sun Mar 4 21:34:08 MST 2012


Hi,

once in a while a mom daemon dies on  one of our nodes (I haven't
figured out the reason for the crash, but that is not really what my
question is after). Thus, I end up with having a bunch of jobs running
on the node, but the node won't be used for new jobs until I restart
the mom. How do I do that without killing the running processes?

We used to be able to do this by using the -p argument for the mom,
but apparently this is not working anymore: everytime I start the
mom using "pbs_mom -p" all running jobs get killed. My feeling
is that -p stopped working when we started to use cpusets (I am
not absolutely sure about this since we also upgraded torque versions
since then). I find the following in the mom_log:

03/04/2012 20:18:07;0002;   pbs_mom;Svr;Log;Log opened
03/04/2012 20:18:07;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.10, loglevel = 0
03/04/2012 20:18:07;0002;   pbs_mom;Svr;initialize_root_cpuset;Init TORQUE cpuset /dev/cpuset/torque.
03/04/2012 20:18:08;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/15' deleted.
03/04/2012 20:18:09;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/14' deleted.
03/04/2012 20:18:10;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/13' deleted.
03/04/2012 20:18:11;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/12' deleted.
03/04/2012 20:18:12;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/11' deleted.
03/04/2012 20:18:13;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/10' deleted.
03/04/2012 20:18:14;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/9' deleted.
03/04/2012 20:18:15;0002;   pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/8' deleted.
03/04/2012 20:18:16;0002;   pbs_mom;Svr;remove_defunct_cpusets;Unused cpuset '/dev/cpuset/torque/6556.dev' deleted.
03/04/2012 20:18:16;0002;   pbs_mom;Svr;setpbsserver;172.18.0.40
03/04/2012 20:18:16;0002;   pbs_mom;Svr;mom_server_add;server 172.18.0.40 added
03/04/2012 20:18:16;0002;   pbs_mom;Svr;setpbsserver;172.18.0.40
03/04/2012 20:18:16;0002;   pbs_mom;Svr;mom_server_add;server host 172.18.0.40 already added
03/04/2012 20:18:16;0002;   pbs_mom;Svr;setpbsserver;localhost
03/04/2012 20:18:16;0002;   pbs_mom;Svr;mom_server_add;server localhost added
03/04/2012 20:18:16;0002;   pbs_mom;Svr;restricted;172.18.0.40
03/04/2012 20:18:16;0002;   pbs_mom;Svr;usecp;*:/home/ /home/
03/04/2012 20:18:16;0002;   pbs_mom;Svr;usecp;*:/global/scratch/ /global/scratch/
03/04/2012 20:18:16;0002;   pbs_mom;Svr;setignvmem;0
03/04/2012 20:18:16;0002;   pbs_mom;Svr;ignmem;1
03/04/2012 20:18:16;0002;   pbs_mom;Svr;settmpdir;/scratch
03/04/2012 20:18:16;0080;   pbs_mom;n/a;add_static;config[11] add name size value [fs=/scratch]
03/04/2012 20:18:16;0002;   pbs_mom;n/a;initialize;independent
03/04/2012 20:18:16;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
03/04/2012 20:18:16;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in task_recov, open of task file
03/04/2012 20:18:16;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in task_recov, open of task file
03/04/2012 20:18:16;0002;   pbs_mom;Svr;pbs_mom;Is up
03/04/2012 20:18:16;0002;   pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/local/torque-2.5.10.dbg/sbin/pbs_mom 1330377127
03/04/2012 20:18:16;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.10, loglevel = 0
03/04/2012 20:18:16;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server 172.18.0.40
03/04/2012 20:18:16;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server localhost
03/04/2012 20:18:17;0008;   pbs_mom;Job;scan_non_child_tasks;found exited session 19901 for task 3 in job 6536.dev
03/04/2012 20:18:17;0008;   pbs_mom;Job;scan_non_child_tasks;found exited session 24272 for task 2 in job 6556.dev
03/04/2012 20:18:18;0002;   pbs_mom;Svr;im_eof;End of File from addr 172.18.0.40:15001
03/04/2012 20:18:18;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server 172.18.0.40
03/04/2012 20:21:24;0002;   pbs_mom;Svr;im_eof;Premature end of message from addr 127.0.0.1:15001
03/04/2012 20:21:25;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server localhost

Thus, it appears that the mom first removes all cpusets in /dev/cpuset/torque
before querying the server whether there still is a corresponding job supposed
to be running.
Anyway, can somebody tell me how to start the mom without killing jobs?
Thanks!!

Cheers,
Martin

-- 
Martin Siegert
Simon Fraser University
Burnaby, British Columbia


More information about the torqueusers mailing list