[torqueusers] qsub -I problem withTorque 2.1.2.

Brad Viviano viviano at renci.org
Tue Dec 11 12:20:35 MST 2007


Garrick,
	Thanks for writing back.  After several hours of playing around with 
different settings, turns out the problem was the torque.cfg file on the 
submit node.  I had the following on both machines:

QSLEEP 3
SERVERHOST frontend.local

On the submit node that was causing the interactive jobs to get 
confused.  I delete the SERVERHOST line, but left "server_name" file 
containing "frontend.local" and everything started working fine.

	Thanks,
		-Brad

Garrick Staples wrote:
> On Mon, Dec 10, 2007 at 09:44:34PM -0500, Brad Viviano alleged:
>> Hello,
>>    I have a 64 node cluster running ROCK version 4.2 x86_64.  I have 
>> Torque 2.1.2 configured with Maui 3.2.6p19.  I have created a dedicated 
>> submit node/compile node on this cluster where users login and submit 
>> jobs, separate from the ROCKS front end node (where Torque and Maui are 
>> running).  On the submit node everything works fine on the submit node 
>> for batch submitting, but not for interactive submitting.  On the 
>> frontend (where the torque/maui servers are running) I can do both batch 
>> and interactive.
>>
>> frontend.local = ROCKS frontend where Torque/Maui servers are running
>> submit0 = Submit node where I am running the qsub from
>> compute-0-0 = compute node I am trying to submit to
>>
>> If I qsub -I from the submit node I get:
>>
>> [viviano at submit0 ~]$ qsub -I
>> qsub: waiting for job 416878.frontend.local to start
>> qsub: job 416878.frontend.local apparently deleted
>>
>> On compute-0-0 where the job is running I see the following in syslog:
>>
>> Dec 10 20:59:00 compute-0-0 pbs_mom: Connection refused (111) in 
>> TMomFinalizeChild, cannot open qsub sock
>>
>> the mom_logs for that node show:
>>
>> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type QueueJob request received 
>> from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type ReadyToCommit request 
>> received from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type Commit request received 
>> from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
>> started, Failure job exec failure, before files staged, no retry
>> 12/10/2007 21:09:15;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type StatusJob request received 
>> from PBS_Server at frontend.local, sock=12
>> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type ModifyJob request received 
>> from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0008;   pbs_mom;Job;416881.frontend.local;Job 
>> Modified at request of PBS_Server at frontend.local
>> 12/10/2007 21:09:15;0100;   pbs_mom;Req;;Type DeleteJob request received 
>> from PBS_Server at frontend.local, sock=11
>>
>>
>> It seems like there is some problem redirecting the terminal if I don't 
>> submit the job on the actual machine Torque/Maui are running on.  I 
>> searched for the error "pbs_mom: Connection refused (111) in 
>> TMomFinalizeChild" but couldn't find anything related to this problem.
>>    Is this just not possible, or am I missing something?  Any 
>> suggestions as to how to debug this would be appreciated.
> 
> For interactive jobs, qsub opens a port and waits for connection from the pbs_mom.
> 
> In this case, I'd guess port filtering on the submit host.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list