[torqueusers] PBS not starting

Vadivelan Ranjith achillesvelan at yahoo.co.in
Thu Jan 18 15:59:49 MST 2007


Hi
Thanks for reply.
actually /etc/pbs.conf having following :
----------------------------------------------------------------------------------------------------
velan at galaxy:/etc$ cat pbs.conf
PBS_EXEC=/usr/pbs
PBS_HOME=/var/spool/torque
PBS_START_SERVER=1
PBS_START_MOM=0
PBS_START_SCHED=1
velan at galaxy:/etc$
----------------------------------------------------------------------------------------------------
Torque was installed in /usr/local/torque and PBS_HOME directory was in two places.
1.  /var/spool/torque (having sched_*, server_* & mom_*
2. /usr/spool/PBS/(having maui, sched_*, server_*...)- Here no mom directory is present.
I changed the /etc/pbs.conf file to 
----------------------------------------------------------------------------------------------------
PBS_EXEC=/usr/local
PBS_HOME=/usr/spool/PBS
PBS_START_SERVER=1
PBS_START_MOM=0
PBS_START_SCHED=1
----------------------------------------------------------------------------------------------------
Now i did /sbin/service pbs restart
----------------------------------------------------------------------------------------------------
root at galaxy:/etc# /sbin/service pbs restart
Restarting PBS
Stopping PBS
This is secondary server, killing process.
PBS server - was pid: 7925
PBS sched - was pid: 7941
Starting PBS
PBS server
PBS sched
root at galaxy:/etc# /sbin/service pbs status
pbs_server is pid 8210
pbs_sched is pid 8226

----------------------------------------------------------------------------------------------------

Now pbs_server is running. I dont know whether I did correctly. 
Now i tried to start maui.. but i gave the old error
----------------------------------------------------------------------------------------------------
root at galaxy:/etc# /usr/local/maui-3.2.6p16/sbin/maui restart
ERROR:    cannot open user interface socket on port 42559
----------------------------------------------------------------------------------------------------

i did ps aux | grep maui
----------------------------------------------------------------------------------------------------
root at galaxy:/etc# ps aux| grep maui
root      6759  0.0  0.4 31712 20200 ?       S    03:19   0:00 /usr/local/maui-3.2.6p16/sbin/maui start
root      8310  0.0  0.0  4812  648 pts/1    S+   04:26   0:00 grep maui
----------------------------------------------------------------------------------------------------

Parallel jobs are running now. But i not able to start the maui. Is it because of disabled the queue(I disabled)?. Thank you for a valuable help

Thanks
Velan


David Chin <david.w.h.chin at gmail.com> wrote: Looks like a few things possibly happening:

1. there may be an old pbs_server process running.
    Check with "ps -elf | grep pbs_"
2. also a pbs_sched process.
3. does the directory /usr/spool/PBS/mom_priv
    exist? Or did the new version put the PBS
    directories elsewhere?

Cheers,
   Dave

On 1/18/07, Vadivelan Ranjith  wrote:
> Hi Friends
> We used PBS and upgrade to torque-2.1.0p0.  Jobs and queue everything was
> fine. Two days before i stop the queue and shutdown the machine for
> renovation. Today i booted frondend and all compute nodes. I was in shock by
> seeing the error msg. I started the pbs in frondend. It gave the following
> msg.
> ----------------------------------------------------------------------------------------------------------------------
> root at galaxy:/usr/local# /sbin/service pbs restart
> Restarting PBS
> Stopping PBS
> Starting PBS
> PBS_Server: Resource temporarily unavailable (11) in PBS_Server, pbs_server:
> another server running
>
> pbs_server: another server running
> PBS server
> cannot change directory to home '/usr/spool/PBS/mom_priv': No such file or
> directory
> PBS mom
> pbs_sched: Address already in use (98) in main, bind
> PBS sched
> root at galaxy:/usr/local#
> ----------------------------------------------------------------------------------------------------------------------
>
>
> And i not able to start maui also. It gave the following error.
> ----------------------------------------------------------------------------------------------------------------------
> root at galaxy:/usr/local# /usr/local/maui-3.2.6p16/sbin/maui start
> ERROR:    cannot open user interface socket on port 42559
> ----------------------------------------------------------------------------------------------------------------------
>
>
>
> I submitted the jobs(I forcely ran it). Jobs with one processor is running
> fine. If i give two processor it gave the following error
>
> ----------------------------------------------------------------------------------------------------------------------
>
> mpdboot_node02.cluster2.iitb.ac.in (handle_mpd_output 359):
> failed to ping mpd on node01; recvd output={}
>
> mpiexec_node02.cluster2.iitb.ac.in: cannot connect to local
> mpd (/tmp/mpd2.console_velan); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_velan); possible
> causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> ----------------------------------------------------------------------------------------------------------------------
>
> my job file having the following details
> ----------------------------------------------------------------------------------------------------------------------
> #!/bin/bash
>
> #PBS -l nodes=2:ppn=1
>
> cd $HOME/2DSIM/1proc
>
> n=`/usr/local/bin/pbs2mpich2hosts.py $PBS_NODEFILE hosts`
>
> /usr/local/bin/mpdboot -n $n -f hosts -r rsh --mpd=/usr/local/bin/mpd
> /usr/local/bin/mpiexec -n 1
> /home/aero/velan/2DSIM/1proc/pg170x91.exe
> /usr/local/bin/mpdallexit
> rm -f hosts
> ----------------------------------------------------------------------------------------------------------------------
>
> can anybody please help me. Actually i not configured this machine and i am
> new to this. I thankyou verymuch for your kind help
>
> Regards
> Velan
>
>
>
>  ________________________________
>  Here's a new way to find what you're looking for - Yahoo! Answers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>


-- 
Email: david.w.h.chin at gmail.com    dwchin at lroc.harvard.edu
Public key: http://gallatin.physics.lsa.umich.edu/~dwchin/crypto.html
      pub   1024D/1C557DDF 2006-07-21 [expires: 2007-07-21]
      Key fingerprint = 4EEB A409 5010 3679 4EA7  D420 4E52 202A 1C55 7DDF


 				
---------------------------------
 Here’s a new way to find what you're looking for - Yahoo! Answers 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070118/f5bde469/attachment-0001.html


More information about the torqueusers mailing list