[torqueusers] Removing the "exec_host" attribute from a queued job ?
Chris Samuel
csamuel at vpac.org
Tue Sep 20 19:47:24 MDT 2005
On Wed, 21 Sep 2005 10:22 am, Garrick Staples wrote:
> Sounds like the initially started job got to the point where it had
> copied the input files for the job before failing. It's worth
> discovering if the original MS node got a job start commit.
This is what happened on the node when it went bad:
pbs_mom;Svr;pbs_mom;Success (0) in TMomFinalizeJob3, read of pipe for sid failed for job 46533.edda-m.vpac.org (0 of 8 bytes)
pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
pbs_mom;Job;46533.edda-m.vpac.org;ALERT: job failed phase 3 start, server will retry
pbs_mom;Req;send_sisters;sending ABORT to sisters
momctl -d 1 doesn't show the mom as thinking it's there, but I've done
a momctl -c 46533 to clear things just in case.
When it tried to restart last time the mom didn't log anything. :-(
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050921/246c09b0/attachment.bin
More information about the torqueusers
mailing list