[torqueusers] torque 4.2.6.1 & 4.2.7: job dies after restarting pbs_mom

Thomas Dargel td at chemie.hu-berlin.de
Mon Mar 17 07:02:23 MDT 2014


Hi All,

  I observed this behavior of torque after restarting a pbs_mom on a joined
sisternode of a parallel job: job is being killed, found these messages:

job.e496:
[node32:06220] plm:tm: failed to spawn daemon, error code = 17000

../../mom_logs/20140317:
03/17/2014 13:39:00;0008;   pbs_mom.6359;Svr;task_save;saving task in 
/var/spool/torque/4261/mom_priv/jobs/496.borneo.TK
03/17/2014 13:39:00;0002;   pbs_mom.6370;n/a;mom_close_poll;entered
03/17/2014 13:39:20;0001;   pbs_mom.6359;Job;496.borneo;task not started, 
'orted', stdio setup failed (see syslog)
03/17/2014 13:39:20;0008;   pbs_mom.6359;Job;496.borneo;ERROR:    received 
request 'SPAWN_TASK' from 192.168.100.132:822 for job '496.borneo' (cannot start 
task)

/var/log/messages:
Mar 17 13:39:00 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:02 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:04 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:06 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:08 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:10 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:12 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:14 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:18 node31 pbs_mom: LOG_ERROR::Connection refused (111) in 
open_demux, open_demux: cannot connect to 192.168.100.132:0
Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) 
in open_demux, open_demux: connect 192.168.100.132:0
Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device (25) 
in start_process, cannot open mux stdout port

It occurs if several job invocations are listed in the job-script, the job
which is running during the restart of pbs_mom will finish fine, the error will
occur during the start-up of the next parallel job out of the job-script.

Torque: 4.2.6.1 and 4.2.7
Maui: 3.3.1
OS: SLES11SP3, 3.0.101-0.15

The job won't be interferred if only the pbs_mom on the master_node of this
job is restarted.

Since I'm very new to the 4.X.X version of torque, I'm not sure if this is
normal behavior. If not, could someone give me a hint how I can overcome this??

Thank you in advance,
regards

  Thomas.



More information about the torqueusers mailing list