[torqueusers] Jobs don't start!

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Wed Oct 12 05:11:22 MDT 2005


Hi Lorenzo,

I think that the reason you did not find the error message
	Execution server rejected request MSG=send failed

in guides or forum, is that it is a new error message format.

You should find it also in the server_logs directory on the
host running pbs_server, together with a time stamp.

If your machines are time synchronized, you would probably
also find a corresponding error message in the mom_logs
directory on the compute node. (According to the qstat
output, the best guess would be on the node medusa001.)
You might be out of luck finding the job number in the
mom_logs file, but you might find a ';req_reject;' line
telling something more about why the pbs_mom did not accept
the job. You should also look into the syslog output,
presumably in file /var/log/messages, where you also may find
some explanation.

I got the 'Execution server rejected request MSG=send failed' message
when a node had disk problems.

I have also tried to help Zwika Galant with a preemption problem (that
was reported on the mauiusers list September 28th), that until now has
boiled down to a problem where the pbs_server gives the same message
as above and the pbs_mom says
	pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, 
type=ReadyToCommit, from PBS_Server at creambo
when Maui tries to start the job for the second time (the first time there
are no errors).

We have the same error log in all three cases, but I do not know if there
are more connections between them. Anyway I wanted to tell you that the
error code appears also in two other installations and I am curious about
how your case will be explained.

It is very interesting that your one-processor jobs start only on the
second attempt. What does the pbs_mom (and perhaps syslog) say about the
first attempt?

-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se
   +46 706 49 55 35
   +46 13 28 26 24


lorenzo118 at interfree.it said:
> I recently installed torque 1.2.0p6 on my cluster linux (16 pentium 4 3.0 
> GHz with Fedora Core 3) following *EXACTLY* what is written in the 
> QuickStart Guide. Everything is ok until I try to run a multi-processor 
> job: it never starts, it remains always in queue. If I type qstart -f I 
> obtain the following messages:
> 
> Job Id: 2.medusa.dicea.unifi.it
>      Job_Name = script
>      Job_Owner = fcapa at medusa.dicea.unifi.it
>      job_state = Q
>      queue = batch
>      server = medusa.dicea.unifi.it
>      Checkpoint = u
>      ctime = Wed Oct 12 11:28:45 2005
>      Error_Path = medusa.dicea.unifi.it:/home/fcapa/script.err
>      exec_host = medusa001.dicea.unifi.it/0+medusa000.dicea.unifi.it/0
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = a
>      mtime = Wed Oct 12 11:41:48 2005
>      Output_Path = medusa.dicea.unifi.it:/home/fcapa/script.out
>      Priority = 0
>      qtime = Wed Oct 12 11:28:45 2005
>      Rerunable = True
>      Resource_List.nodect = 2
>      Resource_List.nodes = 2
>      Resource_List.walltime = 00:05:00
>      Variable_List = PBS_O_HOME=/home/fcapa,PBS_O_LANG=en_US.UTF-8,
>          PBS_O_LOGNAME=fcapa,
>          PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b
>          in:/home/intel_fc_80/lib:/home/intel_fc_80/bin:/opt/kernel_picker/bin:/
>          opt/env-switcher/bin:/opt/mpich-1.2.5.10-ch_p4-gcc/bin:/opt/pvm3/lib:/o
>          pt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/op
>          t/pbs/bin:/opt/pbs/lib/xpbs/bin:/home/fcapa/bin,
>          PBS_O_MAIL=/var/spool/mail/fcapa,PBS_O_SHELL=/bin/bash,
>          PBS_O_HOST=medusa.dicea.unifi.it,PBS_O_WORKDIR=/home/fcapa,
>          MODULE_VERSION_STACK=3.1.6,
>          MANPATH=:/opt/modules/default/man:/opt/kernel_picker/man:/opt/env-swit
>          cher/man:/opt/mpich-1.2.5.10-ch_p4-gcc/man:/usr/share/man:/usr/man:/usr
>          /local/share/man:/usr/local/man:/usr/X11R6/man:/opt/pvm3/man,
>          HOSTNAME=medusa.dicea.unifi.it,PVM_RSH=ssh,
>          _MODULESBEGINENV_=/home/fcapa/.modulesbeginenv,TERM=xterm,
>          SHELL=/bin/bash,HISTSIZE=1000,SSH_CLIENT=::ffff:150.217.9.147 3199 22,
>          MODULE_SHELL=sh,OLDPWD=/home/fcapa,SSH_TTY=/dev/pts/6,
>          MODULE_OSCAR_USER=fcapa,USER=fcapa,
>          LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:
>          cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00
>          ;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00
>          ;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;
>          31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*
>          .cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35
>          :*.png=00;35:*.tif=00;35:,LD_LIBRARY_PATH=:/home/lcampo/rams/RAMS/test,
>          ENV=/home/fcapa/.bashrc,OSCAR_HOME=/opt/oscar,PVM_ROOT=/opt/pvm3,
>          PVM_ARCH=LINUX,MODULE_VERSION=3.1.6,MAIL=/var/spool/mail/fcapa,
>          PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/ho
>          me/intel_fc_80/lib:/home/intel_fc_80/bin:/opt/kernel_picker/bin:/opt/en
>          v-switcher/bin:/opt/mpich-1.2.5.10-ch_p4-gcc/bin:/opt/pvm3/lib:/opt/pvm
>          3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/opt/pbs/
>          bin:/opt/pbs/lib/xpbs/bin:/home/fcapa/bin,INPUTRC=/etc/inputrc,
>          PWD=/home/fcapa,
>          _LMFILES_=/opt/modules/oscar-modulefiles/kernel_picker/1.4.1.3:/opt/en
>          v-switcher/share/env-switcher/mpi/mpich-ch_p4-gcc-1.2.5.10:/opt/modules
>          /oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/defau
>          lt-manpath/1.0.1:/opt/modules/oscar-modulefiles/pvm/3.4.4+11:/opt/modul
>          es/modulefiles/oscar-modules/1.0.5,LANG=en_US.UTF-8,
>          MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-mod
>          ulefiles:/opt/modules/version:/opt/modules/$MODULE_VERSION/modulefiles:
>          /opt/modules/modulefiles:,
>          LOADEDMODULES=kernel_picker/1.4.1.3:mpi/mpich-ch_p4-gcc-1.2.5.10:switc
>          her/1.0.13:default-manpath/1.0.1:pvm/3.4.4+11:oscar-modules/1.0.5,
>          SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=1,
>          HOME=/home/fcapa,LOGNAME=fcapa,
>          SSH_CONNECTION=::ffff:150.217.9.147 3199 ::ffff:150.217.9.65 22,
>          MODULESHOME=/opt/modules/3.1.6,LESSOPEN=|/usr/bin/lesspipe.sh %s,
>          G_BROKEN_FILENAMES=1,_=/usr/local/bin/qsub,PBS_O_QUEUE=batch
>      comment = Not Running - PBS Error: Execution server rejected request 
> MSG=se
>          nd failed, STARTING
>      etime = Wed Oct 12 11:28:45 2005
> 
> 
> I couldn't find the messag error ("comment = Not Running - PBS Error: 
> Execution server rejected request MSG=send failed, STARTING") anywhere, not 
> in guides or in forum. I typed "qsub script" to submit the job, where 
> script is:
> 
> #PBS -l nodes=2,walltime=00:05:00
> #PBS -e script.err
> #PBS -o script.out
> #PBS -V
> mpirun -np 2 ./hello
> 
> 
> If I submit a mono-processor job it gives exactly the same error comment, 
> but after some seconds it runs normally and produces right results. I 
> installed and started normally pbs_mom on each node (for now I'm using only 
> three compute nodes, included the server), pbs_server and pbs_sched are 
> normally running, I copied the following configure file:
> 
> $clienthost     192.168.65.65                      # note: IP address of 
> host running pbs_server
> $logevent       255
> $restricted     192.168.65.65                     # note: IP address of 
> host running pbs_server
> $usecp medusa.dicea.unifi.it:/home /home
> 
> in mom_priv  directory on each node and the right server_name file in each 
> node. In the server_priv directory of the master there is the correct nodes 
> file. I checked directories spool and undelivered and they're empty. So, 
> what's wrong in this configuration?
> Please give me some idea to solve this situation!
> Thank you
> Lorenzo Campo




More information about the torqueusers mailing list