[torqueusers] wrong pbs server name
Gus Correa
gus at ldeo.columbia.edu
Thu May 21 11:50:12 MDT 2009
Hi Samir
Besides Jerry's suggestion on the name resolution:
1) Make sure the server has scheduling turned on
( ... well, sometimes it is not ...):
qmgr -c 'list server scheduling' (to check, and if it is not:)
qmgr -c 'set server scheduling = True'
2) Make sure the scheduler daemon is running.
Assuming you are using the standard pbs_sched:
service pbs_sched status
If it is not running:
service pbs_sched start
If you use the maui scheduler instead, make sure pbs_sched is NOT running:
service pbs_sched stop
and the Maui scheduler is working:
service maui status (to check and if not up:)
service maui start
3) Can your nodes resolve the server name?
I.e. from a node "ping rufian.perrera.local" works?
If not, you may have to include it on /etc/hosts on each node
(assuming this YellowDog for PPC puts the hosts file there (as RHEL,
CentOS, Fedora do)
4) Make sure your nodes are listed on the headnode
file $PBS_HOME/server_priv/nodes, and have the right
np=number-of-processors.
5) On the nodes check also what $pbsserver you have in
$PBS_HOME/mom_priv/config.
6) Searching for errors on the system logs may help, on the nodes and on
the head. Here they are on the /var/log/messages file.
Don't know about YDog.
I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Jerry Smith wrote:
> Samir,
>
> What do you have in $PBS_HOME/{server_name,default_server}?
>
> It should be what resolves as the ethernet address that pbs should be
> listening on.
>
> --Jerry
>
>
>
> Samir Gartner wrote:
>> Ok I finally installed torque under yellowdog/ppc but now I have
>> another problem. I set up my pbs server as rufian.perrera.local but
>> when I issue a job it shows itself in localhost.localdomain and it
>> stays on queued state forever. And if i try to qdel the job it cant
>> reach the server and the conection times out. Any ideas of what could
>> be wrong?
>> I'm not trying to set up anything complicated, is just one machine
>> that works as server and client.
>>
>> this is the shell output
>>
>> [root at rufian bin]# /opt/pbs/bin/qstat -a
>>
>> rufian.perrera.local:
>>
>> Req'd Req'd Elap
>> Job ID Username Queue Jobname SessID NDS
>> TSK Memory Time S Time
>> -------------------- -------- -------- ---------------- ------ -----
>> --- ------ ----- - -----
>> 7.localhost.loca samir batch STDIN -- 1
>> -- -- 01:00 Q --
>> 8.localhost.loca samir batch STDIN -- 1
>> -- -- 01:00 Q --
>> 9.localhost.loca samir batch STDIN -- 1
>> -- -- 01:00 Q --
>> 10.localhost.loc samir batch STDIN -- 1
>> -- -- 01:00 Q --
>> [root at rufian bin]# /opt/pbs/bin/qdel 7.localhost.localdomain
>> Connection timed out
>> qdel: cannot connect to server localhost.localdomain (errno=110)
>> Connection timed out
>> You have new mail in /var/spool/mail/root
>> [root at rufian bin]# /opt/pbs/bin/qdel 7.rufian.perrera.local
>> qdel: Unknown Job Id 7.rufian.perrera.local
>> [root at rufian bin]# su - samir
>> [samir at rufian ~]$ /opt/pbs/bin/qdel 7.localhost.localdomain
>> Connection timed out
>> qdel: cannot connect to server localhost.localdomain (errno=110)
>> Connection timed out
>> [samir at rufian ~]$
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list