[torqueusers] wrong pbs server name

Gus Correa gus at ldeo.columbia.edu
Thu May 21 11:50:12 MDT 2009


Hi Samir

Besides Jerry's suggestion on the name resolution:

1) Make sure the server has scheduling turned on
( ... well, sometimes it is not ...):

qmgr -c 'list server scheduling' (to check, and if it is not:)

qmgr -c 'set server scheduling = True'

2) Make sure the scheduler daemon is running.
Assuming you are using the standard pbs_sched:

service pbs_sched status

If it is not running:

service pbs_sched start

If you use the maui scheduler instead, make sure pbs_sched is NOT running:

service pbs_sched stop

and the Maui scheduler is working:

service maui status (to check and if not up:)

service maui start

3) Can your nodes resolve the server name?
I.e. from a node "ping rufian.perrera.local" works?
If not, you may have to include it on /etc/hosts on each node
(assuming this YellowDog for PPC puts the hosts file there (as RHEL, 
CentOS, Fedora do)

4) Make sure your nodes are listed on the headnode
file $PBS_HOME/server_priv/nodes, and have the right 
np=number-of-processors.

5) On the nodes check also what $pbsserver you have in 
$PBS_HOME/mom_priv/config.

6) Searching for errors on the system logs may help, on the nodes and on 
the head.  Here they are on the /var/log/messages file.
Don't know about YDog.


I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Jerry Smith wrote:
> Samir,
> 
> What do you have in $PBS_HOME/{server_name,default_server}?
> 
> It should be what resolves as the ethernet address that pbs should be 
> listening on.
> 
> --Jerry
> 
> 
> 
> Samir Gartner wrote:
>> Ok I finally installed torque under yellowdog/ppc but now I have 
>> another problem. I set up my pbs server as rufian.perrera.local but 
>> when I issue a job it shows itself in localhost.localdomain and it 
>> stays on queued state forever. And if i try to qdel the job it cant 
>> reach the server and the conection times out. Any ideas of what could 
>> be wrong?
>> I'm not trying to set up anything complicated, is just one machine 
>> that works as server and client.
>>
>> this is the shell output
>>
>> [root at rufian bin]# /opt/pbs/bin/qstat -a
>>
>> rufian.perrera.local:
>>                                                                          
>> Req'd  Req'd   Elap
>> Job ID               Username Queue    Jobname          SessID NDS   
>> TSK Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- 
>> --- ------ ----- - -----
>> 7.localhost.loca     samir    batch    STDIN               --      1  
>> --    --  01:00 Q   --
>> 8.localhost.loca     samir    batch    STDIN               --      1  
>> --    --  01:00 Q   --
>> 9.localhost.loca     samir    batch    STDIN               --      1  
>> --    --  01:00 Q   --
>> 10.localhost.loc     samir    batch    STDIN               --      1  
>> --    --  01:00 Q   --
>> [root at rufian bin]# /opt/pbs/bin/qdel 7.localhost.localdomain
>> Connection timed out
>> qdel: cannot connect to server localhost.localdomain (errno=110) 
>> Connection timed out
>> You have new mail in /var/spool/mail/root
>> [root at rufian bin]# /opt/pbs/bin/qdel 7.rufian.perrera.local
>> qdel: Unknown Job Id 7.rufian.perrera.local
>> [root at rufian bin]# su - samir
>> [samir at rufian ~]$ /opt/pbs/bin/qdel 7.localhost.localdomain
>> Connection timed out
>> qdel: cannot connect to server localhost.localdomain (errno=110) 
>> Connection timed out
>> [samir at rufian ~]$
>>
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list