[torquedev] Connection problems with torque 2.4.11

Lukasz Flis l.flis at cyf-kr.edu.pl
Tue Nov 23 09:50:18 MST 2010


Dear Torque Users and Developers,

We have recently migrated our torque installation from version 2.3.9 to 
2.4.11.
Unfortunately we are facing problems with multiple jobs being run twice 
because of some problem described below:

It seems that after finishing the job, mom is unable to contact pbs server.
A quick look at mom_server code shows that there is client_to_srv 
function being used
for establishing mom->pbs_server connection via TCP protocol on port 1023

The problem is PBS server is not listening on TCP 1023 port (according 
to netstat).
As a result client_to_srv always ends up with connection refused error.

This is causing a lot of problems (especially with grid services) 
because some jobs are run twice even if the first run has been successful

Below are logs gathered from one of hundreds pbs_moms indicating some 
problem with OBIT delivery.


11/21/2010 19:06:56;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:07:41;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:08:26;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:09:11;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:09:56;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:10:41;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:11:26;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in post_epilogue, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:11:54;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
2.4.11, loglevel = 0
11/21/2010 19:12:11;0080;   pbs_mom;Job;7500115.batch;obit sent to server
11/21/2010 19:12:11;0080;   pbs_mom;Job;7500115.batch;removing transient 
job directory /scratch/7500115.
11/21/2010 19:13:31;0001; 
pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE
11/21/2010 19:13:31;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
7500603.batch.grid.cyf-kr.edu.pl started, pid = 22519
11/21/2010 19:15:01;0080; 
pbs_mom;Job;7500603.batch.;scan_for_terminated: job 
7500603.batch.grid.cyf-kr.edu.pl task 1 terminated, sid=22519
11/21/2010 19:15:01;0008;   pbs_mom;Job;7500603.batch;job was terminated
11/21/2010 19:15:01;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in scan_for_exiting, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:15:02;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in scan_for_exiting, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:15:03;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in scan_for_exiting, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:15:04;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in scan_for_exiting, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:15:05;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in scan_for_exiting, cannot connect to port 1023 in 
client_to_svr - connection refused
11/21/2010 19:15:06;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now 
in progress (115) in scan_for_exiting, cannot connect to port 1023 in 
client_to_svr - connection refused


Best Regards
--
Lukasz Flis


More information about the torquedev mailing list