[torqueusers] torque 4.0.2

Gus Correa gus at ldeo.columbia.edu
Fri Jun 15 14:19:18 MDT 2012


On 06/15/2012 03:33 PM, Andrus, Brian Contractor wrote:
> Delphine,
>
> Check your queues and ensure they are enabled and started. Eg:
> 	qmgr -c 'set queue tiny enabled = True'
> 	qmgr -c 'set queue tiny started = True'
>
>
> Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10)
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>

... and to enable scheduling:

qmgr -c 'set server scheduling = True'

***

Can the server name on mom_priv/config be resolved by
the compute nodes?
Typically in /etc/hosts, and associated to your cluster
private subnet. Say:

mom_priv/config:
$pbsserver	headnode

/etc/hosts:
192.168.1.1  headnode

***
Did you run 'pbsnodes' to see which nodes/moms respond?
Did you check the server and mom logs for possible error messages?
Did you check /var/log/messages for errors?

I hope this helps,
Gus Correa


> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom
> Sent: Friday, June 15, 2012 5:57 AM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] torque 4.0.2
>
> Dear David,
>
> I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root.
> I've installed the default pbs_sched.
> momctl diagnoses that no local jobs detected : that's wrong...
>
> Have you got an idea what is the problem ? thanks.
>
> # qstat
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 29.metis                   ExampleJob       dramalin               0 Q
> batch
> 32.metis                   ExampleJob       dramalin               0 Q
> batch
>
>
> # momctl -h metis.univ.run -d 0
>
> Host: metis.univ.run/metis.univ.run   Version: 4.0.2   PID: 2807
> Server[0]: metis.univ.run (10.90.0.12:15001)
>     Last Msg From Server:   281 seconds (DeleteJob)
>     Last Msg To Server:     41 seconds
> HomeDirectory:          /var/spool/torque/mom_priv
> MOM active:             1947 seconds
> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
> NOTE:  no local jobs detected
>
> diagnostics complete
>
> # momctl -p 15002 -h metis.univ.run -d 3
> ERROR:    query[0] 'diag3' failed on metis.univ.run (errno=0 - Success :
> 0 - Success)
>
> delphine
>
>
> Le 13/06/12 20:09, David Beer a écrit :
>> Delphine,
>>
>> This is an issue that is fixed in subsequent releases of 4.0.0. Please
>> download 4.0.2:
>> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0
>> .2.tar.gz
>> and the problem will be resolved.
>>
>> David
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list