[torqueusers] torque 4.0.2

Delphine Ramalingom delphine.ramalingom at univ-reunion.fr
Mon Jun 18 04:28:02 MDT 2012


Thanks for your suggestions.

I think the problem is that I'm on a workstation, a unique server for 
three daemons pbs_server, pbs_mom and pbs_sched.

Delphine

Le 16/06/12 00:19, Gus Correa a écrit :
> On 06/15/2012 03:33 PM, Andrus, Brian Contractor wrote:
>> Delphine,
>>
>> Check your queues and ensure they are enabled and started. Eg:
>> 	qmgr -c 'set queue tiny enabled = True'
>> 	qmgr -c 'set queue tiny started = True'
>>
>>
>> Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10)
>>
>> Brian Andrus
>> ITACS/Research Computing
>> Naval Postgraduate School
>> Monterey, California
>> voice: 831-656-6238
>>
>>
> ... and to enable scheduling:
>
> qmgr -c 'set server scheduling = True'
>
> ***
>
> Can the server name on mom_priv/config be resolved by
> the compute nodes?
> Typically in /etc/hosts, and associated to your cluster
> private subnet. Say:
>
> mom_priv/config:
> $pbsserver	headnode
>
> /etc/hosts:
> 192.168.1.1  headnode
>
> ***
> Did you run 'pbsnodes' to see which nodes/moms respond?
> Did you check the server and mom logs for possible error messages?
> Did you check /var/log/messages for errors?
>
> I hope this helps,
> Gus Correa
>
>
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom
>> Sent: Friday, June 15, 2012 5:57 AM
>> To: Torque Users Mailing List
>> Subject: Re: [torqueusers] torque 4.0.2
>>
>> Dear David,
>>
>> I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root.
>> I've installed the default pbs_sched.
>> momctl diagnoses that no local jobs detected : that's wrong...
>>
>> Have you got an idea what is the problem ? thanks.
>>
>> # qstat
>> Job id                    Name             User            Time Use S Queue
>> ------------------------- ---------------- --------------- -------- - -----
>> 29.metis                   ExampleJob       dramalin               0 Q
>> batch
>> 32.metis                   ExampleJob       dramalin               0 Q
>> batch
>>
>>
>> # momctl -h metis.univ.run -d 0
>>
>> Host: metis.univ.run/metis.univ.run   Version: 4.0.2   PID: 2807
>> Server[0]: metis.univ.run (10.90.0.12:15001)
>>      Last Msg From Server:   281 seconds (DeleteJob)
>>      Last Msg To Server:     41 seconds
>> HomeDirectory:          /var/spool/torque/mom_priv
>> MOM active:             1947 seconds
>> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>> NOTE:  no local jobs detected
>>
>> diagnostics complete
>>
>> # momctl -p 15002 -h metis.univ.run -d 3
>> ERROR:    query[0] 'diag3' failed on metis.univ.run (errno=0 - Success :
>> 0 - Success)
>>
>> delphine
>>
>>
>> Le 13/06/12 20:09, David Beer a écrit :
>>> Delphine,
>>>
>>> This is an issue that is fixed in subsequent releases of 4.0.0. Please
>>> download 4.0.2:
>>> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0
>>> .2.tar.gz
>>> and the problem will be resolved.
>>>
>>> David
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>



More information about the torqueusers mailing list