[torqueusers] error in torque 1.2.0p6
jacksond at clusterresources.com
Thu Jan 12 23:25:08 MST 2006
I think your first step would be to upgrade to the latest TORQUE (ie
2.0.0p5). Garrick contributed several patches to improve the stability
of pbs_sched. Your second step may be to upgrade off of pbs_sched.
Please let us know if this fixes the instability.
On Thu, 2006-01-12 at 17:43 -0800, Mr Tony Ling wrote:
> I have 128 nodes cluster running torque 1.2.0p6 . Everytime when
> the user submit a batch of jobs, the torque scheduler will terminated
> itself and come with following error in the log file. Then the users
> can't submit any more jobs, unless the torque scheduler is been
> restarted again.
> PBS_Server;Connection refused (111) in contact_sched, Could not
> contact Scheduler - port 15004
> 01/12/2006 09:58:46;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111) in contact_sched, Could not contact Scheduler - port 15004
> I have to write a cron job to check the health of torque
> scheduler process, if it is dealth then start it again.
> Any helpful people please help me in this. Thanks.
> Yahoo! Photos
> Got holiday prints? See all the ways to get quality prints in your
> hands ASAP.
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers