[torqueusers] torque 4.0.2
Gus Correa
gus at ldeo.columbia.edu
Fri Jun 15 14:19:18 MDT 2012
On 06/15/2012 03:33 PM, Andrus, Brian Contractor wrote:
> Delphine,
>
> Check your queues and ensure they are enabled and started. Eg:
> qmgr -c 'set queue tiny enabled = True'
> qmgr -c 'set queue tiny started = True'
>
>
> Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10)
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
... and to enable scheduling:
qmgr -c 'set server scheduling = True'
***
Can the server name on mom_priv/config be resolved by
the compute nodes?
Typically in /etc/hosts, and associated to your cluster
private subnet. Say:
mom_priv/config:
$pbsserver headnode
/etc/hosts:
192.168.1.1 headnode
***
Did you run 'pbsnodes' to see which nodes/moms respond?
Did you check the server and mom logs for possible error messages?
Did you check /var/log/messages for errors?
I hope this helps,
Gus Correa
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom
> Sent: Friday, June 15, 2012 5:57 AM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] torque 4.0.2
>
> Dear David,
>
> I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root.
> I've installed the default pbs_sched.
> momctl diagnoses that no local jobs detected : that's wrong...
>
> Have you got an idea what is the problem ? thanks.
>
> # qstat
> Job id Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 29.metis ExampleJob dramalin 0 Q
> batch
> 32.metis ExampleJob dramalin 0 Q
> batch
>
>
> # momctl -h metis.univ.run -d 0
>
> Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807
> Server[0]: metis.univ.run (10.90.0.12:15001)
> Last Msg From Server: 281 seconds (DeleteJob)
> Last Msg To Server: 41 seconds
> HomeDirectory: /var/spool/torque/mom_priv
> MOM active: 1947 seconds
> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
> NOTE: no local jobs detected
>
> diagnostics complete
>
> # momctl -p 15002 -h metis.univ.run -d 3
> ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success :
> 0 - Success)
>
> delphine
>
>
> Le 13/06/12 20:09, David Beer a écrit :
>> Delphine,
>>
>> This is an issue that is fixed in subsequent releases of 4.0.0. Please
>> download 4.0.2:
>> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0
>> .2.tar.gz
>> and the problem will be resolved.
>>
>> David
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list