[torqueusers] torque 4.0.2

Coyle, James J [ITACD] jjc at iastate.edu
Fri Jun 15 13:39:32 MDT 2012


Things to check:

1)      firewall between compute nodes and head node without Torque ports open to compute nodes.

2)      Wrong name in /var/spool/torque/server_name

3)      cluster is on an internal 172.16 network and head node has two Ethernet connections,

a  172.16 internal IP address on eth1 for use as the cluster (named metis)

and a routable IP address on eth0 for accesss to the outside world.

For 3,  I have fixed this by using    metis  external.name.iastate.edu  external.name

I also set the hostname to metis  with /usr/bin/system-config-network

The metis goes into /var/spool/torque/server_name
on head nodes (metis) and on all compute nodes.

James Coyle, PhD
High Performance Computing Group
 Iowa State Univ.
web: http://jjc.public.iastate.edu/<http://www.public.iastate.edu/~jjc>

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Friday, June 15, 2012 10:46 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] torque 4.0.2

I don't think that pbs_sched is the way to go for a basic setup - I recommend Maui. I think pbs_sched takes some more work before it will actually start scheduling (perhaps someone else with more experience with pbs_sched can offer some quick setup steps?) but once you get Maui talking to pbs_server it will run jobs for you. I recommend you go that way.


On Fri, Jun 15, 2012 at 6:56 AM, Delphine Ramalingom <delphine.ramalingom at univ-reunion.fr<mailto:delphine.ramalingom at univ-reunion.fr>> wrote:
Dear David,

I've installed torque 4.0.2, but job stay in queue unless I make a qrun
as root.
I've installed the default pbs_sched.
momctl diagnoses that no local jobs detected : that's wrong...

Have you got an idea what is the problem ? thanks.

# qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
29.metis                   ExampleJob       dramalin               0 Q
32.metis                   ExampleJob       dramalin               0 Q

# momctl -h metis.univ.run -d 0

Host: metis.univ.run/metis.univ.run   Version: 4.0.2   PID: 2807
Server[0]: metis.univ.run (<>)
  Last Msg From Server:   281 seconds (DeleteJob)
  Last Msg To Server:     41 seconds
HomeDirectory:          /var/spool/torque/mom_priv
MOM active:             1947 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
NOTE:  no local jobs detected

diagnostics complete

# momctl -p 15002 -h metis.univ.run -d 3
ERROR:    query[0] 'diag3' failed on metis.univ.run (errno=0 - Success :
0 - Success)


Le 13/06/12 20:09, David Beer a écrit :
> Delphine,
> This is an issue that is fixed in subsequent releases of 4.0.0. Please
> download 4.0.2:
> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.2.tar.gz
> and the problem will be resolved.
> David

torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>

David Beer | Software Engineer
Adaptive Computing

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120615/543e9f80/attachment.html 

More information about the torqueusers mailing list