[torqueusers] Problem with rcp
Michael Homa
mhoma at uic.edu
Wed Sep 14 10:32:17 MDT 2005
Hi:
Am installing version 1.2.0p3 and have encountered the same problem as Ryan
Butler who posted the following on July 5th:
For my torque cluster, I can successfully submit jobs and execute
them, but when the job tries to return the .OU and .ER files to the
submitting node/server, it hangs with an 'E' in qstat, and i find
{PBS_HOME}/spool/rcperr.**** (some number). This file contains an rcp
error. Is this a problem with rcp?
When I do a tracejob on hello-world test job: I see the following error:
09/14/2005 10:25:32 S Post job file processing error
After multiple runs of the test job on one node with different settings for
$logevent and $loglevel, I didn't see anything in the mom_logs to explain
the problem. Then I started monkeying with different settings in the
mom_config file and noticed that I get the following error ( I added the
brackets to highlight) in my mom_log when I restart the mom_daemon:
09/14/2005 11:00:33;0002; pbs_mom;Svr;Log;Log opened
09/14/2005 11:00:33;0002; pbs_mom;Svr;setloglevel;1
09/14/2005 11:00:33;0002; pbs_mom;Svr;restricted;*.cc.uic.edu
09/14/2005 11:00:33;0002; pbs_mom;Svr;max_load;10.0
09/14/2005 11:00:33;0080; pbs_mom;n/a;add_static;config[0] add name
max_load value 10.0
09/14/2005 11:00:33;0002; pbs_mom;Svr;usecp;argo-fs:/home/homes50
/home/homes50
09/14/2005 11:00:33;0002; pbs_mom;Svr;usecp;argo-fs:/home/homes51
/home/homes51
09/14/2005 11:00:33;0002; pbs_mom;Svr;usecp;argo-fs:/home/homes52
/home/homes52
09/14/2005 11:00:33;0002; pbs_mom;Svr;usecp;argo-fs:/home/homes53
/home/homes53
09/14/2005 11:00:33;0002; pbs_mom;n/a;initialize;independent
===================================================================================================
09/14/2005 11:00:33;0001; pbs_mom;Svr;pbs_mom;Bad file descriptor (9)
in recov_tmsock, read
09/14/2005 11:00:33;0001; pbs_mom;Svr;pbs_mom;Inappropriate ioctl for
device (25) in job_recov, err from recov_tmsock
===================================================================================================
09/14/2005 11:00:33;0002; pbs_mom;Svr;pbs_mom;Is up
09/14/2005 11:00:33;0002; pbs_mom;n/a;is_update_stat;hello sent to server
Now maybe the error in the log is unrelated to the rcp problem but this is
the only evidence I can find other than the message in the tracejob. The
machine argo-fs is not the head node; rather, it is a file server
NFS-mounted to all machines, including the head. I tried changing the usecp
to the head node, resulting in the same error. No change. Looking through
the torque archives, I've tried or checked the following:
o permissions of 777 on /usr/spool/pbs/spool
o declared both alias and fully-qualifed domain name of the head node
in /etc/hosts.equiv on
head node
I need a fresh eyes and suggestions. Thanks
Michael Homa
Systems Group
ACCC - University of Illinois at Chicago
More information about the torqueusers
mailing list