[Mauiusers] pbs_mom: req_cpyfile, Unable to copy file

Arnau Bria arnaubria at pic.es
Thu Nov 8 09:17:06 MST 2007


Hi Valery,

thanks for your quick reply.

We have 3 CEs, and in two of them we already have set MaxStartups to
100 (and we get errors there too).

Do you think we still need to increase this value?

Cheers,
Arnau

On Thu, 8 Nov 2007 19:07:54 +0300 (MSK)
Valery Mitsyn wrote:

> Hola Arnau,
> 
> this can be a result of bunch of simultaneous connection
> from WNs to CE. Check on the CE "MaxStartups" in /etc/sshd_config
> and try to increase it to 100, the default is 50 wich can be
> too low in some situations.
> 
> On Thu, 8 Nov 2007, Arnau Bria wrote:
> 
> > Hi,
> >
> >
> > a couple of days I sent this e-mail to torque list. I got no reply,
> > so I decided to post here too, maybe someone has seen this error
> > before.
> >
> > Sorry in advance for the cross-posting.
> >
> >
> > we're getting sporadic errors when jobs finishes running in a WN and
> > has to copy its output to submitter host.
> >
> > We've configured ssh in our submitter/executer in order to avoid
> > requesting password, so for example:
> >
> > [root at td237 ~]# su - ops006
> > [ops006 at td237 ~]$ ssh ce07 date
> > Scientific Linux CERN Release 3.0.8 (SL)
> > Tue Nov  6 12:09:05 CET 2007
> > [ops006 at td237 ~]$
> >
> > But looking job's log in WN we find:
> > Oct 24 02:20:48 td237 pbs_mom: req_cpyfile, Unable to copy file
> > ops006 at ce07.pic.es:/home/ops006/.lcgjm/globus-cache-export.Q18475/globus-cache-e
> > xport.Q18475.gpg to globus-cache-export.Q18475.gpg
> >
> > and in pbs server:
> > [root at pbs01 root]# grep
> > 3145425 /var/spool/pbs/server_priv/accounting/200710* /var/spool/pbs/server_priv/accounting/20071024:10/24/2007
> > 02:18:46;Q;3145425.pbs01.pic.es;queue=gshort
> > /var/spool/pbs/server_priv/accounting/20071024:10/24/2007
> > 02:21:44;D;3145425.pbs01.pic.es;requestor=ops006 at ce07.pic.es
> >
> >
> > finally, maui's log:
> >
> > [root at pbs01 root]# grep 3145425 /var/log/maui.log*
> > /var/log/maui.log.1:10/24 02:20:44 INFO:     job '3145425' loaded:
> > 1   ops006 ops  86400       Idle   0 1193185126   [NONE] [NONE]
> > [NONE]
> >> =      0 >= 0 [slc4] 1193185244
> > /var/log/maui.log.1:10/24 02:20:44 MRMJobStart(3145425,Msg,SC)
> > /var/log/maui.log.1:10/24 02:20:44 MPBSJobStart(3145425,base,Msg,SC)
> > /var/log/maui.log.1:10/24 02:20:44
> > MPBSJobModify(3145425,Resource_List,Resource,td237.pic.es)
> > /var/log/maui.log.1:10/24 02:20:44
> > MPBSJobModify(3145425,Resource_List,Resource,1)
> > /var/log/maui.log.1:10/24 02:20:44 WARNING:  cannot set job
> > '3145425.pbs01.pic.es' attr 'Resource_List:neednodes' to '1' (rc:
> > 15001 'Unknown Job Id')
> > /var/log/maui.log.1:10/24 02:20:44 INFO:     job '3145425'
> > successfully started /var/log/maui.log.1:10/24 02:22:45 INFO:
> > active PBS job 3145425 has been removed from the queue.  assuming
> > successful completion
> >
> >
> > AS I commented at the beginnig of the mail, errors are sporadic,
> > but we find lots certain days, i certain WN. All wn share conf, so
> > no difference between them a part of the job that are running.
> >
> > Versions:
> > in WN:
> > [root at td237 ~]# rpm -qa|grep torque
> > torque-devel-2.1.8-1cri_sl4_1st.i386
> > torque-mom-2.1.8-1cri_sl4_1st.i386
> > torque-2.1.8-1cri_sl4_1st.i386
> > torque-client-2.1.8-1cri_sl4_1st.i386
> > torque-docs-2.1.8-1cri_sl4_1st.i386
> >
> > in server:
> > [root at pbs01 root]# rpm -qa|grep torque
> > torque-gui-2.1.8-1cri_sl3_1st
> > torque-client-2.1.8-1cri_sl3_1st
> > torque-server-2.1.8-1cri_sl3_1st
> > torque-2.1.8-1cri_sl3_1st
> >
> > TIA,
> > Arnau
> > _______________________________________________
> > mauiusers mailing list
> > mauiusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/mauiusers
> >
> 


More information about the mauiusers mailing list