[torqueusers] qsub -I problem withTorque 2.1.2.

Garrick Staples garrick at usc.edu
Tue Dec 11 10:55:56 MST 2007


On Mon, Dec 10, 2007 at 09:44:34PM -0500, Brad Viviano alleged:
> Hello,
>    I have a 64 node cluster running ROCK version 4.2 x86_64.  I have 
> Torque 2.1.2 configured with Maui 3.2.6p19.  I have created a dedicated 
> submit node/compile node on this cluster where users login and submit 
> jobs, separate from the ROCKS front end node (where Torque and Maui are 
> running).  On the submit node everything works fine on the submit node 
> for batch submitting, but not for interactive submitting.  On the 
> frontend (where the torque/maui servers are running) I can do both batch 
> and interactive.
> 
> frontend.local = ROCKS frontend where Torque/Maui servers are running
> submit0 = Submit node where I am running the qsub from
> compute-0-0 = compute node I am trying to submit to
> 
> If I qsub -I from the submit node I get:
> 
> [viviano at submit0 ~]$ qsub -I
> qsub: waiting for job 416878.frontend.local to start
> qsub: job 416878.frontend.local apparently deleted
> 
> On compute-0-0 where the job is running I see the following in syslog:
> 
> Dec 10 20:59:00 compute-0-0 pbs_mom: Connection refused (111) in 
> TMomFinalizeChild, cannot open qsub sock
> 
> the mom_logs for that node show:
> 
> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type QueueJob request received 
> from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type ReadyToCommit request 
> received from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type Commit request received 
> from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
> started, Failure job exec failure, before files staged, no retry
> 12/10/2007 21:09:15;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type StatusJob request received 
> from PBS_Server at frontend.local, sock=12
> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type ModifyJob request received 
> from PBS_Server at frontend.local, sock=10
> 12/10/2007 21:09:15;0008;   pbs_mom;Job;416881.frontend.local;Job 
> Modified at request of PBS_Server at frontend.local
> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type DeleteJob request received 
> from PBS_Server at frontend.local, sock=11
> 
> 
> It seems like there is some problem redirecting the terminal if I don't 
> submit the job on the actual machine Torque/Maui are running on.  I 
> searched for the error "pbs_mom: Connection refused (111) in 
> TMomFinalizeChild" but couldn't find anything related to this problem.
>    Is this just not possible, or am I missing something?  Any 
> suggestions as to how to debug this would be appreciated.

For interactive jobs, qsub opens a port and waits for connection from the pbs_mom.

In this case, I'd guess port filtering on the submit host.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071211/62d53241/attachment.bin


More information about the torqueusers mailing list