HPUX Re: [torqueusers] mom rejecting job?

Lippert, Kenneth B. Kenneth.Lippert at alcoa.com
Wed Dec 14 07:52:12 MST 2005


Thanks for the quick reply.

As I look in the mom_logs of the client machines, I see no attempt that
the server even tried to send a job.

I cleaned out and restarted all the mom files and directories and now it
all seems to be working more or less.  I did make sure and name all my
nodes and server references by their fully qualified (with domain)
names.  I think that helped.

I still have one small problem.  When a job finishes, the stdout and
stderr files are not properly copied back to the originating host.  As
the admin, I get this:

>>>>>>>>>>>>>>>>>
Post job file processing error; job 34.server.ddd.alcoa.com on host
client.ddd.alcoa.com/0

Unable to copy file 34.server..OU to
server.ddd.alcoa.com:/nfshome/lippert/rpdd_cluster/job.o34

Unable to copy file 34.server..ER to
server.ddd.alcoa.com:/nfshome/lippert/rpdd_cluster/job.e34
<<<<<<<<<<<<<<<<<

This happens whether I submit from an NFS area or not, and regardless of
which machine I submitted from (it was the server in this case).

I still have more reading to do in the manual around NFS, rcp, and
stage-in, stage-out; so I may be doing something obviously stupid.  I
know HPUX does not have ssh built in, 

Thanks for all your help.  Torque is sure working out a lot better than
DQS.  

-k





-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
Staples
Sent: Tuesday, December 13, 2005 4:17 PM
To: torqueusers at supercluster.org
Subject: HPUX Re: [torqueusers] mom rejecting job?

On Tue, Dec 13, 2005 at 04:04:23PM -0500, Lippert, Kenneth B. alleged:
> Hello again.  
> 
> Back on the HPUX.  I gave up trying to get the HPUX client to work
with
> the Linux server, so I just made one of the HPUX machines a server,
and
> set the HPUX machines as a separate cluster from the Linux one.
> 
> Things are progressing.  Now I can submit a job from any of the
> machines, but if I request it run anywhere except the server the job
> queues forever with the following from maui's "checkjob".
> 
> job is deferred.  Reason: RMFailure (cannot start job - RMFailure, rc:
> 15041, msg 'execution server rejected request MSG=sendfailed,
STARTING')

So MOM running on the server host works, but others are rejecting?  Must
be a config problem.  Can we see a high loglevel snippet from the MOM
log?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California


More information about the torqueusers mailing list