[torqueusers] Jobs don't start!

Lorenzo Campo lorenzo118 at interfree.it
Wed Oct 12 03:58:08 MDT 2005


Hi,
I recently installed torque 1.2.0p6 on my cluster linux (16 pentium 4 3.0 
GHz with Fedora Core 3) following *EXACTLY* what is written in the 
QuickStart Guide. Everything is ok until I try to run a multi-processor 
job: it never starts, it remains always in queue. If I type qstart -f I 
obtain the following messages:

Job Id: 2.medusa.dicea.unifi.it
     Job_Name = script
     Job_Owner = fcapa at medusa.dicea.unifi.it
     job_state = Q
     queue = batch
     server = medusa.dicea.unifi.it
     Checkpoint = u
     ctime = Wed Oct 12 11:28:45 2005
     Error_Path = medusa.dicea.unifi.it:/home/fcapa/script.err
     exec_host = medusa001.dicea.unifi.it/0+medusa000.dicea.unifi.it/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = a
     mtime = Wed Oct 12 11:41:48 2005
     Output_Path = medusa.dicea.unifi.it:/home/fcapa/script.out
     Priority = 0
     qtime = Wed Oct 12 11:28:45 2005
     Rerunable = True
     Resource_List.nodect = 2
     Resource_List.nodes = 2
     Resource_List.walltime = 00:05:00
     Variable_List = PBS_O_HOME=/home/fcapa,PBS_O_LANG=en_US.UTF-8,
         PBS_O_LOGNAME=fcapa,
         PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b
         in:/home/intel_fc_80/lib:/home/intel_fc_80/bin:/opt/kernel_picker/bin:/
         opt/env-switcher/bin:/opt/mpich-1.2.5.10-ch_p4-gcc/bin:/opt/pvm3/lib:/o
         pt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/op
         t/pbs/bin:/opt/pbs/lib/xpbs/bin:/home/fcapa/bin,
         PBS_O_MAIL=/var/spool/mail/fcapa,PBS_O_SHELL=/bin/bash,
         PBS_O_HOST=medusa.dicea.unifi.it,PBS_O_WORKDIR=/home/fcapa,
         MODULE_VERSION_STACK=3.1.6,
         MANPATH=:/opt/modules/default/man:/opt/kernel_picker/man:/opt/env-swit
         cher/man:/opt/mpich-1.2.5.10-ch_p4-gcc/man:/usr/share/man:/usr/man:/usr
         /local/share/man:/usr/local/man:/usr/X11R6/man:/opt/pvm3/man,
         HOSTNAME=medusa.dicea.unifi.it,PVM_RSH=ssh,
         _MODULESBEGINENV_=/home/fcapa/.modulesbeginenv,TERM=xterm,
         SHELL=/bin/bash,HISTSIZE=1000,SSH_CLIENT=::ffff:150.217.9.147 3199 22,
         MODULE_SHELL=sh,OLDPWD=/home/fcapa,SSH_TTY=/dev/pts/6,
         MODULE_OSCAR_USER=fcapa,USER=fcapa,
         LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:
         cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00
         ;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00
         ;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;
         31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*
         .cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35
         :*.png=00;35:*.tif=00;35:,LD_LIBRARY_PATH=:/home/lcampo/rams/RAMS/test,
         ENV=/home/fcapa/.bashrc,OSCAR_HOME=/opt/oscar,PVM_ROOT=/opt/pvm3,
         PVM_ARCH=LINUX,MODULE_VERSION=3.1.6,MAIL=/var/spool/mail/fcapa,
         PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/ho
         me/intel_fc_80/lib:/home/intel_fc_80/bin:/opt/kernel_picker/bin:/opt/en
         v-switcher/bin:/opt/mpich-1.2.5.10-ch_p4-gcc/bin:/opt/pvm3/lib:/opt/pvm
         3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/opt/pbs/
         bin:/opt/pbs/lib/xpbs/bin:/home/fcapa/bin,INPUTRC=/etc/inputrc,
         PWD=/home/fcapa,
         _LMFILES_=/opt/modules/oscar-modulefiles/kernel_picker/1.4.1.3:/opt/en
         v-switcher/share/env-switcher/mpi/mpich-ch_p4-gcc-1.2.5.10:/opt/modules
         /oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/defau
         lt-manpath/1.0.1:/opt/modules/oscar-modulefiles/pvm/3.4.4+11:/opt/modul
         es/modulefiles/oscar-modules/1.0.5,LANG=en_US.UTF-8,
         MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-mod
         ulefiles:/opt/modules/version:/opt/modules/$MODULE_VERSION/modulefiles:
         /opt/modules/modulefiles:,
         LOADEDMODULES=kernel_picker/1.4.1.3:mpi/mpich-ch_p4-gcc-1.2.5.10:switc
         her/1.0.13:default-manpath/1.0.1:pvm/3.4.4+11:oscar-modules/1.0.5,
         SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=1,
         HOME=/home/fcapa,LOGNAME=fcapa,
         SSH_CONNECTION=::ffff:150.217.9.147 3199 ::ffff:150.217.9.65 22,
         MODULESHOME=/opt/modules/3.1.6,LESSOPEN=|/usr/bin/lesspipe.sh %s,
         G_BROKEN_FILENAMES=1,_=/usr/local/bin/qsub,PBS_O_QUEUE=batch
     comment = Not Running - PBS Error: Execution server rejected request 
MSG=se
         nd failed, STARTING
     etime = Wed Oct 12 11:28:45 2005


I couldn't find the messag error ("comment = Not Running - PBS Error: 
Execution server rejected request MSG=send failed, STARTING") anywhere, not 
in guides or in forum. I typed "qsub script" to submit the job, where 
script is:

#PBS -l nodes=2,walltime=00:05:00
#PBS -e script.err
#PBS -o script.out
#PBS -V
mpirun -np 2 ./hello


If I submit a mono-processor job it gives exactly the same error comment, 
but after some seconds it runs normally and produces right results. I 
installed and started normally pbs_mom on each node (for now I'm using only 
three compute nodes, included the server), pbs_server and pbs_sched are 
normally running, I copied the following configure file:

$clienthost     192.168.65.65                      # note: IP address of 
host running pbs_server
$logevent       255
$restricted     192.168.65.65                     # note: IP address of 
host running pbs_server
$usecp medusa.dicea.unifi.it:/home /home

in mom_priv  directory on each node and the right server_name file in each 
node. In the server_priv directory of the master there is the correct nodes 
file. I checked directories spool and undelivered and they're empty. So, 
what's wrong in this configuration?
Please give me some idea to solve this situation!
Thank you
Lorenzo Campo




More information about the torqueusers mailing list