[torqueusers] qsub -I problem withTorque 2.1.2.

Brad Viviano viviano at renci.org
Mon Dec 10 19:44:34 MST 2007


Hello,
    I have a 64 node cluster running ROCK version 4.2 x86_64.  I have 
Torque 2.1.2 configured with Maui 3.2.6p19.  I have created a dedicated 
submit node/compile node on this cluster where users login and submit 
jobs, separate from the ROCKS front end node (where Torque and Maui are 
running).  On the submit node everything works fine on the submit node 
for batch submitting, but not for interactive submitting.  On the 
frontend (where the torque/maui servers are running) I can do both batch 
and interactive.

frontend.local = ROCKS frontend where Torque/Maui servers are running
submit0 = Submit node where I am running the qsub from
compute-0-0 = compute node I am trying to submit to

If I qsub -I from the submit node I get:

[viviano at submit0 ~]$ qsub -I
qsub: waiting for job 416878.frontend.local to start
qsub: job 416878.frontend.local apparently deleted

On compute-0-0 where the job is running I see the following in syslog:

Dec 10 20:59:00 compute-0-0 pbs_mom: Connection refused (111) in 
TMomFinalizeChild, cannot open qsub sock

the mom_logs for that node show:

12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at frontend.local, sock=10
12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at frontend.local, sock=10
12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type Commit request received 
from PBS_Server at frontend.local, sock=10
12/10/2007 21:09:15;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Failure job exec failure, before files staged, no retry
12/10/2007 21:09:15;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at frontend.local, sock=12
12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type ModifyJob request received 
from PBS_Server at frontend.local, sock=10
12/10/2007 21:09:15;0008;   pbs_mom;Job;416881.frontend.local;Job 
Modified at request of PBS_Server at frontend.local
12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at frontend.local, sock=11


It seems like there is some problem redirecting the terminal if I don't 
submit the job on the actual machine Torque/Maui are running on.  I 
searched for the error "pbs_mom: Connection refused (111) in 
TMomFinalizeChild" but couldn't find anything related to this problem.
    Is this just not possible, or am I missing something?  Any 
suggestions as to how to debug this would be appreciated.

    Thanks,
       -Brad Viviano


More information about the torqueusers mailing list