[torqueusers] Torque job failures during stagein
Ramon Bastiaans
ramon.bastiaans at sara.nl
Fri Nov 26 02:24:53 MST 2010
Hi,
On 11/23/2010 06:00 PM, Lukasz Flis wrote:
> Hi,
>
> I've seen your post on torquedev mailing list.
>
> We observe the same behaviour on our site since we migrated from 2.3.9
> to 2.4.11.
>
The failures we experience only occur when the stagein source files do
not exist.
I have filed a bug report for this:
* http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96
> Could you please tell me which torque version are you currently using
> and have you managed to solve the problem?
>
For that particular bug, there is no simple solution. The jobs should
timeout, but can clutter the scheduling. We are now running a script
from cron to check for the existence of Hold jobs. Which is very ugly.
> We also see many of the following messages in mom logs:
>
> 11/21/2010 19:16:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:32;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:34;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:35;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:36;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:37;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:37;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:38;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:39;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:40;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:41;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:42;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:43;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:44;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
>
I think these messages are unrelated to the staging. Perhaps you are
experiencing a network issue.
We don't see these messages at all.
> We suspect that other problem with jobs being run multiple times is
> related to a problem with sending OBIT to pbs_server. Could you confirm
> the presence of such messages for your torque version?
>
We only see "normal" obit messages
11/26/2010 00:05:45;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/26/2010 00:05:45;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
11/26/2010 00:05:45;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
11/26/2010 00:05:45;0080; pbs_mom;Job;8563915.xxx;obit sent to server
Which seems to be "normal" behavior and also unrelated to the staging.
> Thank you in advance for any help
>
> Best Regards
> --
> Lukasz Flis
> ACC Cyfronet AGH
> Poland
I'm afraid I don't have a simple solution.
Kind regards,
- Ramon.
--
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V
SARA - Computing& Networking Services
Science Park 121 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5883 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20101126/fd732f76/attachment.bin
More information about the torqueusers
mailing list