[torqueusers] Problem with rcp

Michael Homa mhoma at uic.edu
Wed Sep 14 10:32:17 MDT 2005


Am installing version 1.2.0p3 and have encountered the same problem as Ryan 
Butler who posted the following on July 5th:

   For my torque cluster, I can successfully submit jobs and execute
   them, but when the job tries to return the .OU and .ER files to the
   submitting node/server, it hangs with an 'E' in qstat, and i find
   {PBS_HOME}/spool/rcperr.**** (some number).  This file contains an rcp
   error.  Is this a problem with rcp?

When I do a tracejob on hello-world test job: I see the following error:

   09/14/2005 10:25:32  S    Post job file processing error

After multiple runs of the test job on one node with different settings for 
$logevent and $loglevel, I didn't see anything in the mom_logs to explain 
the problem. Then I started monkeying with different settings in the 
mom_config file and noticed that I get the following error ( I added the 
brackets to highlight) in my mom_log when I restart the mom_daemon:

   09/14/2005 11:00:33;0002;   pbs_mom;Svr;Log;Log opened
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;setloglevel;1
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;restricted;*.cc.uic.edu
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;max_load;10.0
   09/14/2005 11:00:33;0080;   pbs_mom;n/a;add_static;config[0] add name 
max_load value 10.0
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;usecp;argo-fs:/home/homes50 
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;usecp;argo-fs:/home/homes51 
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;usecp;argo-fs:/home/homes52 
   09/14/2005 11:00:33;0002;   pbs_mom;Svr;usecp;argo-fs:/home/homes53 
   09/14/2005 11:00:33;0002;   pbs_mom;n/a;initialize;independent
   09/14/2005 11:00:33;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in recov_tmsock, read
   09/14/2005 11:00:33;0001;   pbs_mom;Svr;pbs_mom;Inappropriate ioctl for 
device (25) in job_recov,   err from recov_tmsock

   09/14/2005 11:00:33;0002;   pbs_mom;Svr;pbs_mom;Is up
   09/14/2005 11:00:33;0002;   pbs_mom;n/a;is_update_stat;hello sent to server

Now maybe the error in the log is unrelated to the rcp problem but this is 
the only evidence I can find other than the message in the tracejob. The 
machine argo-fs is not the head node; rather, it is a file server 
NFS-mounted to all machines, including the head. I tried changing the usecp 
to the head node, resulting in the same error. No change. Looking through 
the torque archives, I've tried or checked the following:
    o permissions of 777 on /usr/spool/pbs/spool
    o declared both alias and fully-qualifed domain name of the head node 
in /etc/hosts.equiv on
        head node

I need a fresh eyes and suggestions. Thanks

Michael Homa
Systems Group
ACCC - University of Illinois at Chicago

More information about the torqueusers mailing list