[torqueusers] Torque job failures during stagein

Ramon Bastiaans ramon.bastiaans at sara.nl
Fri Nov 26 02:24:53 MST 2010


Hi,

On 11/23/2010 06:00 PM, Lukasz Flis wrote:
> Hi,
>
> I've seen your post on torquedev mailing list.
>
> We observe the same behaviour on our site since we migrated from 2.3.9
> to 2.4.11.
>
The failures we experience only occur when the stagein source files do 
not exist.

I have filed a bug report for this:

  * http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96

> Could you please tell me which torque version are you currently using
> and have you managed to solve the problem?
>
For that particular bug, there is no simple solution. The jobs should 
timeout, but can clutter the scheduling. We are now running a script 
from cron to check for the existence of Hold jobs. Which is very ugly.

> We also see many of the following messages in mom logs:
>
> 11/21/2010 19:16:31;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:32;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:33;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:34;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:35;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:36;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:37;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:37;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:38;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:39;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:40;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:41;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:42;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:43;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> 11/21/2010 19:16:44;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now
> in progress (115) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
>
I think these messages are unrelated to the staging. Perhaps you are 
experiencing a network issue.

We don't see these messages at all.

> We suspect that other problem with jobs being run multiple times is
> related to a problem with sending OBIT to pbs_server. Could you confirm
> the presence of such messages for your torque version?
>
We only see "normal" obit messages

11/26/2010 00:05:45;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
11/26/2010 00:05:45;0080;   
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
11/26/2010 00:05:45;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
11/26/2010 00:05:45;0080;   pbs_mom;Job;8563915.xxx;obit sent to server

Which seems to be "normal" behavior and also unrelated to the staging.

> Thank you in advance for any help
>
> Best Regards
> --
> Lukasz Flis
> ACC Cyfronet AGH
> Poland
I'm afraid I don't have a simple solution.


Kind regards,
- Ramon.

-- 
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V

SARA - Computing&  Networking Services
Science Park 121     PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5883 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20101126/fd732f76/attachment.bin 


More information about the torqueusers mailing list