[torqueusers] pbs_mom: how to regularly retry connection to PBS server?

Jan-Philip Gehrcke jgehrcke at googlemail.com
Thu Apr 4 02:57:24 MDT 2013


Hello,

Can we make pbs_mom regularly retry the connection to the server?

Scenario:
In one of our nodes the network interface comes up quite late after 
reboot and sometimes pbs_mom starts before that. W.r.t. init scripts, 
pbs_mom requires $network, however, the issue still appears from time to 
time.

In these cases, when pbs_mom comes up, it right away tries to connect to 
the PBS server two times without significant delay in between (log 
below). Both attempts fail. After that, pbs_mom does *never* attempt to 
connect to the server again (according to the log). A manual restart of 
pbs_mom after the network interface becomes available makes it connect 
to the server immediately.

If pbs_mom regularly attempted to connect to the server, no manual 
action would be required, the system would be up and running about 2 min 
after reboot.

04/03/2013 13:07:16;0002;   pbs_mom.1655;Svr;Log;Log opened
04/03/2013 13:07:16;0002;   pbs_mom.1655;Svr;pbs_mom;Torque Mom Version 
= 4.1.5, loglevel = 0
04/03/2013 13:07:16;0002;   pbs_mom.1743;Svr;setpbsserver;gpu1
04/03/2013 13:07:16;0001;   pbs_mom.1743;Svr;pbs_mom;LOG_ERROR::Access 
from host not allowed, or unknown host (15010) in mom_server_add, Cannot 
resolve host gpu1 for pbs_server
04/03/2013 13:07:16;0001; 
pbs_mom.1743;Svr;pbs_mom;LOG_ERROR::read_config, config[1] special 
command pbsserver failed with gpu1
04/03/2013 13:07:16;0002;   pbs_mom.1743;Svr;setcheckpolltime;20
04/03/2013 13:07:16;0002;   pbs_mom.1743;Svr;setstateuspdatetime;20
04/03/2013 13:07:16;0002;   pbs_mom.1743;Svr;setpbsserver;gpu1
04/03/2013 13:07:16;0001;   pbs_mom.1743;Svr;pbs_mom;LOG_ERROR::Access 
from host not allowed, or unknown host (15010) in mom_server_add, Cannot 
resolve host gpu1 for pbs_server
04/03/2013 13:07:16;0002;   pbs_mom.1743;n/a;initialize;independent
04/03/2013 13:07:16;0080;   pbs_mom.1743;Svr;pbs_mom;before init_abort_jobs
04/03/2013 13:07:16;0002;   pbs_mom.1743;Svr;pbs_mom;Is up
04/03/2013 13:07:16;0002; 
pbs_mom.1743;Svr;setup_program_environment;MOM executable path and mtime 
at launch: /usr/local/sbin/pbs_mom 1363630059
04/03/2013 13:07:19;0002;   pbs_mom.1743;Svr;pbs_mom;Torque Mom Version 
= 4.1.5, loglevel = 0
04/03/2013 13:12:19;0002;   pbs_mom.1743;Svr;pbs_mom;Torque Mom Version 
= 4.1.5, loglevel = 0
04/03/2013 13:17:19;0002;   pbs_mom.1743;Svr;pbs_mom;Torque Mom Version 
= 4.1.5, loglevel = 0
...


Can we somehow achieve regular retry attempts?


Thanks for your help,

Jan-Philip Gehrcke


More information about the torqueusers mailing list