[torqueusers] torque 4.0.2
Coyle, James J [ITACD]
jjc at iastate.edu
Fri Jun 15 13:39:32 MDT 2012
Things to check:
1) firewall between compute nodes and head node without Torque ports open to compute nodes.
2) Wrong name in /var/spool/torque/server_name
3) cluster is on an internal 172.16 network and head node has two Ethernet connections,
a 172.16 internal IP address on eth1 for use as the cluster (named metis)
and a routable IP address on eth0 for accesss to the outside world.
For 3, I have fixed this by using
172.16.100.1 metis external.name.iastate.edu external.name
I also set the hostname to metis with /usr/bin/system-config-network
The metis goes into /var/spool/torque/server_name
on head nodes (metis) and on all compute nodes.
James Coyle, PhD
High Performance Computing Group
Iowa State Univ.
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Friday, June 15, 2012 10:46 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] torque 4.0.2
I don't think that pbs_sched is the way to go for a basic setup - I recommend Maui. I think pbs_sched takes some more work before it will actually start scheduling (perhaps someone else with more experience with pbs_sched can offer some quick setup steps?) but once you get Maui talking to pbs_server it will run jobs for you. I recommend you go that way.
On Fri, Jun 15, 2012 at 6:56 AM, Delphine Ramalingom <delphine.ramalingom at univ-reunion.fr<mailto:delphine.ramalingom at univ-reunion.fr>> wrote:
I've installed torque 4.0.2, but job stay in queue unless I make a qrun
I've installed the default pbs_sched.
momctl diagnoses that no local jobs detected : that's wrong...
Have you got an idea what is the problem ? thanks.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
29.metis ExampleJob dramalin 0 Q
32.metis ExampleJob dramalin 0 Q
# momctl -h metis.univ.run -d 0
Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807
Server: metis.univ.run (10.90.0.12:15001<http://10.90.0.12:15001>)
Last Msg From Server: 281 seconds (DeleteJob)
Last Msg To Server: 41 seconds
MOM active: 1947 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
NOTE: no local jobs detected
# momctl -p 15002 -h metis.univ.run -d 3
ERROR: query 'diag3' failed on metis.univ.run (errno=0 - Success :
0 - Success)
Le 13/06/12 20:09, David Beer a écrit :
> This is an issue that is fixed in subsequent releases of 4.0.0. Please
> download 4.0.2:
> and the problem will be resolved.
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
David Beer | Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers