[torqueusers] PBS not starting

David Chin david.w.h.chin at gmail.com
Thu Jan 18 14:27:14 MST 2007


Looks like a few things possibly happening:

1. there may be an old pbs_server process running.
    Check with "ps -elf | grep pbs_"
2. also a pbs_sched process.
3. does the directory /usr/spool/PBS/mom_priv
    exist? Or did the new version put the PBS
    directories elsewhere?

Cheers,
   Dave

On 1/18/07, Vadivelan Ranjith <achillesvelan at yahoo.co.in> wrote:
> Hi Friends
> We used PBS and upgrade to torque-2.1.0p0.  Jobs and queue everything was
> fine. Two days before i stop the queue and shutdown the machine for
> renovation. Today i booted frondend and all compute nodes. I was in shock by
> seeing the error msg. I started the pbs in frondend. It gave the following
> msg.
> ----------------------------------------------------------------------------------------------------------------------
> root at galaxy:/usr/local# /sbin/service pbs restart
> Restarting PBS
> Stopping PBS
> Starting PBS
> PBS_Server: Resource temporarily unavailable (11) in PBS_Server, pbs_server:
> another server running
>
> pbs_server: another server running
> PBS server
> cannot change directory to home '/usr/spool/PBS/mom_priv': No such file or
> directory
> PBS mom
> pbs_sched: Address already in use (98) in main, bind
> PBS sched
> root at galaxy:/usr/local#
> ----------------------------------------------------------------------------------------------------------------------
>
>
> And i not able to start maui also. It gave the following error.
> ----------------------------------------------------------------------------------------------------------------------
> root at galaxy:/usr/local# /usr/local/maui-3.2.6p16/sbin/maui start
> ERROR:    cannot open user interface socket on port 42559
> ----------------------------------------------------------------------------------------------------------------------
>
>
>
> I submitted the jobs(I forcely ran it). Jobs with one processor is running
> fine. If i give two processor it gave the following error
>
> ----------------------------------------------------------------------------------------------------------------------
>
> mpdboot_node02.cluster2.iitb.ac.in (handle_mpd_output 359):
> failed to ping mpd on node01; recvd output={}
>
> mpiexec_node02.cluster2.iitb.ac.in: cannot connect to local
> mpd (/tmp/mpd2.console_velan); possible causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_velan); possible
> causes:
>   1. no mpd is running on this host
>   2. an mpd is running but was started without a "console" (-n option)
> ----------------------------------------------------------------------------------------------------------------------
>
> my job file having the following details
> ----------------------------------------------------------------------------------------------------------------------
> #!/bin/bash
>
> #PBS -l nodes=2:ppn=1
>
> cd $HOME/2DSIM/1proc
>
> n=`/usr/local/bin/pbs2mpich2hosts.py $PBS_NODEFILE hosts`
>
> /usr/local/bin/mpdboot -n $n -f hosts -r rsh --mpd=/usr/local/bin/mpd
> /usr/local/bin/mpiexec -n 1
> /home/aero/velan/2DSIM/1proc/pg170x91.exe
> /usr/local/bin/mpdallexit
> rm -f hosts
> ----------------------------------------------------------------------------------------------------------------------
>
> can anybody please help me. Actually i not configured this machine and i am
> new to this. I thankyou verymuch for your kind help
>
> Regards
> Velan
>
>
>
>  ________________________________
>  Here's a new way to find what you're looking for - Yahoo! Answers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>


-- 
Email: david.w.h.chin at gmail.com    dwchin at lroc.harvard.edu
Public key: http://gallatin.physics.lsa.umich.edu/~dwchin/crypto.html
      pub   1024D/1C557DDF 2006-07-21 [expires: 2007-07-21]
      Key fingerprint = 4EEB A409 5010 3679 4EA7  D420 4E52 202A 1C55 7DDF


More information about the torqueusers mailing list