[torqueusers] qsub -I problem withTorque 2.1.2.
Garrick Staples
garrick at usc.edu
Tue Dec 11 10:55:56 MST 2007
On Mon, Dec 10, 2007 at 09:44:34PM -0500, Brad Viviano alleged:
> Hello,
> I have a 64 node cluster running ROCK version 4.2 x86_64. I have
> Torque 2.1.2 configured with Maui 3.2.6p19. I have created a dedicated
> submit node/compile node on this cluster where users login and submit
> jobs, separate from the ROCKS front end node (where Torque and Maui are
> running). On the submit node everything works fine on the submit node
> for batch submitting, but not for interactive submitting. On the
> frontend (where the torque/maui servers are running) I can do both batch
> and interactive.
>
> frontend.local = ROCKS frontend where Torque/Maui servers are running
> submit0 = Submit node where I am running the qsub from
> compute-0-0 = compute node I am trying to submit to
>
> If I qsub -I from the submit node I get:
>
> [viviano at submit0 ~]$ qsub -I
> qsub: waiting for job 416878.frontend.local to start
> qsub: job 416878.frontend.local apparently deleted
>
> On compute-0-0 where the job is running I see the following in syslog:
>
> Dec 10 20:59:00 compute-0-0 pbs_mom: Connection refused (111) in
> TMomFinalizeChild, cannot open qsub sock
>
> the mom_logs for that node show:
>
> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type QueueJob request received
> from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type ReadyToCommit request
> received from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type Commit request received
> from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0001; pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, before files staged, no retry
> 12/10/2007 21:09:15;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type StatusJob request received
> from PBS_Server at frontend.local, sock=12
> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type ModifyJob request received
> from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0008; pbs_mom;Job;416881.frontend.local;Job
> Modified at request of PBS_Server at frontend.local
> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type DeleteJob request received
> from PBS_Server at frontend.local, sock=11
>
>
> It seems like there is some problem redirecting the terminal if I don't
> submit the job on the actual machine Torque/Maui are running on. I
> searched for the error "pbs_mom: Connection refused (111) in
> TMomFinalizeChild" but couldn't find anything related to this problem.
> Is this just not possible, or am I missing something? Any
> suggestions as to how to debug this would be appreciated.
For interactive jobs, qsub opens a port and waits for connection from the pbs_mom.
In this case, I'd guess port filtering on the submit host.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071211/62d53241/attachment.bin
More information about the torqueusers
mailing list