HPUX Re: [torqueusers] mom rejecting job?
Lippert, Kenneth B.
Kenneth.Lippert at alcoa.com
Wed Dec 14 07:52:12 MST 2005
Thanks for the quick reply.
As I look in the mom_logs of the client machines, I see no attempt that
the server even tried to send a job.
I cleaned out and restarted all the mom files and directories and now it
all seems to be working more or less. I did make sure and name all my
nodes and server references by their fully qualified (with domain)
names. I think that helped.
I still have one small problem. When a job finishes, the stdout and
stderr files are not properly copied back to the originating host. As
the admin, I get this:
>>>>>>>>>>>>>>>>>
Post job file processing error; job 34.server.ddd.alcoa.com on host
client.ddd.alcoa.com/0
Unable to copy file 34.server..OU to
server.ddd.alcoa.com:/nfshome/lippert/rpdd_cluster/job.o34
Unable to copy file 34.server..ER to
server.ddd.alcoa.com:/nfshome/lippert/rpdd_cluster/job.e34
<<<<<<<<<<<<<<<<<
This happens whether I submit from an NFS area or not, and regardless of
which machine I submitted from (it was the server in this case).
I still have more reading to do in the manual around NFS, rcp, and
stage-in, stage-out; so I may be doing something obviously stupid. I
know HPUX does not have ssh built in,
Thanks for all your help. Torque is sure working out a lot better than
DQS.
-k
-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
Staples
Sent: Tuesday, December 13, 2005 4:17 PM
To: torqueusers at supercluster.org
Subject: HPUX Re: [torqueusers] mom rejecting job?
On Tue, Dec 13, 2005 at 04:04:23PM -0500, Lippert, Kenneth B. alleged:
> Hello again.
>
> Back on the HPUX. I gave up trying to get the HPUX client to work
with
> the Linux server, so I just made one of the HPUX machines a server,
and
> set the HPUX machines as a separate cluster from the Linux one.
>
> Things are progressing. Now I can submit a job from any of the
> machines, but if I request it run anywhere except the server the job
> queues forever with the following from maui's "checkjob".
>
> job is deferred. Reason: RMFailure (cannot start job - RMFailure, rc:
> 15041, msg 'execution server rejected request MSG=sendfailed,
STARTING')
So MOM running on the server host works, but others are rejecting? Must
be a config problem. Can we see a high loglevel snippet from the MOM
log?
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
More information about the torqueusers
mailing list