[torqueusers] qsub -I problem withTorque 2.1.2.
Brad Viviano
viviano at renci.org
Tue Dec 11 12:20:35 MST 2007
Garrick,
Thanks for writing back. After several hours of playing around with
different settings, turns out the problem was the torque.cfg file on the
submit node. I had the following on both machines:
QSLEEP 3
SERVERHOST frontend.local
On the submit node that was causing the interactive jobs to get
confused. I delete the SERVERHOST line, but left "server_name" file
containing "frontend.local" and everything started working fine.
Thanks,
-Brad
Garrick Staples wrote:
> On Mon, Dec 10, 2007 at 09:44:34PM -0500, Brad Viviano alleged:
>> Hello,
>> I have a 64 node cluster running ROCK version 4.2 x86_64. I have
>> Torque 2.1.2 configured with Maui 3.2.6p19. I have created a dedicated
>> submit node/compile node on this cluster where users login and submit
>> jobs, separate from the ROCKS front end node (where Torque and Maui are
>> running). On the submit node everything works fine on the submit node
>> for batch submitting, but not for interactive submitting. On the
>> frontend (where the torque/maui servers are running) I can do both batch
>> and interactive.
>>
>> frontend.local = ROCKS frontend where Torque/Maui servers are running
>> submit0 = Submit node where I am running the qsub from
>> compute-0-0 = compute node I am trying to submit to
>>
>> If I qsub -I from the submit node I get:
>>
>> [viviano at submit0 ~]$ qsub -I
>> qsub: waiting for job 416878.frontend.local to start
>> qsub: job 416878.frontend.local apparently deleted
>>
>> On compute-0-0 where the job is running I see the following in syslog:
>>
>> Dec 10 20:59:00 compute-0-0 pbs_mom: Connection refused (111) in
>> TMomFinalizeChild, cannot open qsub sock
>>
>> the mom_logs for that node show:
>>
>> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type QueueJob request received
>> from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type ReadyToCommit request
>> received from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type Commit request received
>> from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0001; pbs_mom;Job;TMomFinalizeJob3;job not
>> started, Failure job exec failure, before files staged, no retry
>> 12/10/2007 21:09:15;0008; pbs_mom;Req;send_sisters;sending ABORT to
>> sisters
>> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type StatusJob request received
>> from PBS_Server at frontend.local, sock=12
>> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type ModifyJob request received
>> from PBS_Server at frontend.local, sock=10
>> 12/10/2007 21:09:15;0008; pbs_mom;Job;416881.frontend.local;Job
>> Modified at request of PBS_Server at frontend.local
>> 12/10/2007 21:09:15;0100; pbs_mom;Req;;Type DeleteJob request received
>> from PBS_Server at frontend.local, sock=11
>>
>>
>> It seems like there is some problem redirecting the terminal if I don't
>> submit the job on the actual machine Torque/Maui are running on. I
>> searched for the error "pbs_mom: Connection refused (111) in
>> TMomFinalizeChild" but couldn't find anything related to this problem.
>> Is this just not possible, or am I missing something? Any
>> suggestions as to how to debug this would be appreciated.
>
> For interactive jobs, qsub opens a port and waits for connection from the pbs_mom.
>
> In this case, I'd guess port filtering on the submit host.
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list