[torqueusers] Jobs don't start!

Lorenzo Campo lorenzo118 at interfree.it
Wed Oct 12 06:51:29 MDT 2005


Hi Lennart,
thank you for the answer. In fact I did only one attempt for the 
mono-processor, but while I waited it started I checked several time with 
qstat -f. First times I found the error comment, last ones I didn't, and it 
worked with no problems.
I checked, as you suggested, in mom_logs file of medusa001 and I didn't 
find any 'req-reject' line. The whole file is:

10/12/2005 11:19:47;0002;   pbs_mom;Svr;Log;Log opened
10/12/2005 11:19:47;0002;   pbs_mom;Svr;pbs_mom;caught signal 15: leaving 
jobs running, just exiting
10/12/2005 11:19:47;0002;   pbs_mom;Svr;pbs_mom;Is down
10/12/2005 11:19:47;0002;   pbs_mom;Svr;Log;Log closed
10/12/2005 11:24:57;0002;   pbs_mom;Svr;Log;Log opened
10/12/2005 11:24:57;0002;   pbs_mom;Svr;restricted;192.168.65.65
10/12/2005 11:24:57;0002;   pbs_mom;Svr;usecp;medusa.dicea.unifi.it:/home /home
10/12/2005 11:24:57;0002;   pbs_mom;n/a;initialize;independent
10/12/2005 11:24:57;0002;   pbs_mom;Svr;pbs_mom;Is up
10/12/2005 11:24:57;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:26:41;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:26:41;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:26:53;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:26:53;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:50:41;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:50:41;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:50:49;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:50:49;0002;   pbs_mom;n/a;mom_main;hello sent to server


while on medusa000 (that is also the torque server) the same file is:

10/12/2005 11:26:23;0002;   pbs_mom;Svr;Log;Log opened
10/12/2005 11:26:23;0002;   pbs_mom;Svr;pbs_mom;caught signal 15: leaving 
jobs running, just exiting
10/12/2005 11:26:23;0002;   pbs_mom;Svr;pbs_mom;Is down
10/12/2005 11:26:23;0002;   pbs_mom;Svr;Log;Log closed
10/12/2005 11:26:32;0002;   pbs_mom;Svr;Log;Log opened
10/12/2005 11:26:32;0002;   pbs_mom;Svr;restricted;192.168.65.65
10/12/2005 11:26:32;0002;   pbs_mom;Svr;usecp;medusa.dicea.unifi.it:/home /home
10/12/2005 11:26:32;0002;   pbs_mom;n/a;initialize;independent
10/12/2005 11:26:32;0002;   pbs_mom;Svr;pbs_mom;Is up
10/12/2005 11:26:32;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:26:41;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:26:41;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:26:53;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:26:53;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:1023
10/12/2005 11:26:53;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:31:27;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
3.medusa.dicea.unifi.it started, pid = 19233
10/12/2005 
11:31:28;0080;   pbs_mom;Job;3.medusa.dicea.unifi.it;scan_for_terminated: 
job 3.medusa.dicea.unifi.it task 1 terminated, sid 19233
10/12/2005 11:31:28;0008;   pbs_mom;Job;3.medusa.dicea.unifi.it;Terminated
10/12/2005 11:50:41;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:50:41;0002;   pbs_mom;n/a;mom_main;hello sent to server
10/12/2005 11:50:49;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:15001
10/12/2005 11:50:49;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 192.168.65.65:1023
10/12/2005 11:50:49;0002;   pbs_mom;n/a;mom_main;hello sent to server

I sent the same script (with one or two processor requested) no more than 
4-5 times, so files are very short. I noticed this "End of File" message 
and the code 15001 (that should mean 'Unknown Job Id' if I well 
understood), but I don't understand if it's an error, a warning or it's 
just normal because the subsequent line that says 'hello sent to the 
server'. So I can't understand what exactly blocks the execution of the run.
Any idea?
Lorenzo Campo



At 13.11 12/10/2005, you wrote:
>Hi Lorenzo,
>
>I think that the reason you did not find the error message
>         Execution server rejected request MSG=send failed
>
>in guides or forum, is that it is a new error message format.
>
>You should find it also in the server_logs directory on the
>host running pbs_server, together with a time stamp.
>
>If your machines are time synchronized, you would probably
>also find a corresponding error message in the mom_logs
>directory on the compute node. (According to the qstat
>output, the best guess would be on the node medusa001.)
>You might be out of luck finding the job number in the
>mom_logs file, but you might find a ';req_reject;' line
>telling something more about why the pbs_mom did not accept
>the job. You should also look into the syslog output,
>presumably in file /var/log/messages, where you also may find
>some explanation.
>
>I got the 'Execution server rejected request MSG=send failed' message
>when a node had disk problems.
>
>I have also tried to help Zwika Galant with a preemption problem (that
>was reported on the mauiusers list September 28th), that until now has
>boiled down to a problem where the pbs_server gives the same message
>as above and the pbs_mom says
>         pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), 
> aux=0,
>type=ReadyToCommit, from PBS_Server at creambo
>when Maui tries to start the job for the second time (the first time there
>are no errors).
>
>We have the same error log in all three cases, but I do not know if there
>are more connections between them. Anyway I wanted to tell you that the
>error code appears also in two other installations and I am curious about
>how your case will be explained.
>
>It is very interesting that your one-processor jobs start only on the
>second attempt. What does the pbs_mom (and perhaps syslog) say about the
>first attempt?
>
>-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
>    National Supercomputer Centre in Linkoping, Sweden
>    http://www.nsc.liu.se
>    +46 706 49 55 35
>    +46 13 28 26 24
>
>
>lorenzo118 at interfree.it said:
> > I recently installed torque 1.2.0p6 on my cluster linux (16 pentium 4 3.0
> > GHz with Fedora Core 3) following *EXACTLY* what is written in the
> > QuickStart Guide. Everything is ok until I try to run a multi-processor
> > job: it never starts, it remains always in queue. If I type qstart -f I
> > obtain the following messages:
> >
> > Job Id: 2.medusa.dicea.unifi.it
> >      Job_Name = script
> >      Job_Owner = fcapa at medusa.dicea.unifi.it
> >      job_state = Q
> >      queue = batch
> >      server = medusa.dicea.unifi.it
> >      Checkpoint = u
> >      ctime = Wed Oct 12 11:28:45 2005
> >      Error_Path = medusa.dicea.unifi.it:/home/fcapa/script.err
> >      exec_host = medusa001.dicea.unifi.it/0+medusa000.dicea.unifi.it/0
> >      Hold_Types = n
> >      Join_Path = n
> >      Keep_Files = n
> >      Mail_Points = a
> >      mtime = Wed Oct 12 11:41:48 2005
> >      Output_Path = medusa.dicea.unifi.it:/home/fcapa/script.out
> >      Priority = 0
> >      qtime = Wed Oct 12 11:28:45 2005
> >      Rerunable = True
> >      Resource_List.nodect = 2
> >      Resource_List.nodes = 2
> >      Resource_List.walltime = 00:05:00
> >      Variable_List = PBS_O_HOME=/home/fcapa,PBS_O_LANG=en_US.UTF-8,
> >          PBS_O_LOGNAME=fcapa,
> > 
> PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b
> > 
> in:/home/intel_fc_80/lib:/home/intel_fc_80/bin:/opt/kernel_picker/bin:/
> > 
> opt/env-switcher/bin:/opt/mpich-1.2.5.10-ch_p4-gcc/bin:/opt/pvm3/lib:/o
> > 
> pt/pvm3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/op
> >          t/pbs/bin:/opt/pbs/lib/xpbs/bin:/home/fcapa/bin,
> >          PBS_O_MAIL=/var/spool/mail/fcapa,PBS_O_SHELL=/bin/bash,
> >          PBS_O_HOST=medusa.dicea.unifi.it,PBS_O_WORKDIR=/home/fcapa,
> >          MODULE_VERSION_STACK=3.1.6,
> > 
> MANPATH=:/opt/modules/default/man:/opt/kernel_picker/man:/opt/env-swit
> > 
> cher/man:/opt/mpich-1.2.5.10-ch_p4-gcc/man:/usr/share/man:/usr/man:/usr
> >          /local/share/man:/usr/local/man:/usr/X11R6/man:/opt/pvm3/man,
> >          HOSTNAME=medusa.dicea.unifi.it,PVM_RSH=ssh,
> >          _MODULESBEGINENV_=/home/fcapa/.modulesbeginenv,TERM=xterm,
> >          SHELL=/bin/bash,HISTSIZE=1000,SSH_CLIENT=::ffff:150.217.9.147 
> 3199 22,
> >          MODULE_SHELL=sh,OLDPWD=/home/fcapa,SSH_TTY=/dev/pts/6,
> >          MODULE_OSCAR_USER=fcapa,USER=fcapa,
> > 
> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:
> > 
> cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00
> > 
> ;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00
> > 
> ;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;
> > 
> 31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*
> > 
> .cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35
> > 
> :*.png=00;35:*.tif=00;35:,LD_LIBRARY_PATH=:/home/lcampo/rams/RAMS/test,
> >          ENV=/home/fcapa/.bashrc,OSCAR_HOME=/opt/oscar,PVM_ROOT=/opt/pvm3,
> >          PVM_ARCH=LINUX,MODULE_VERSION=3.1.6,MAIL=/var/spool/mail/fcapa,
> > 
> PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/ho
> > 
> me/intel_fc_80/lib:/home/intel_fc_80/bin:/opt/kernel_picker/bin:/opt/en
> > 
> v-switcher/bin:/opt/mpich-1.2.5.10-ch_p4-gcc/bin:/opt/pvm3/lib:/opt/pvm
> > 
> 3/lib/LINUX:/opt/pvm3/bin/LINUX:/usr/local/apitest:/opt/c3-4/:/opt/pbs/
> >          bin:/opt/pbs/lib/xpbs/bin:/home/fcapa/bin,INPUTRC=/etc/inputrc,
> >          PWD=/home/fcapa,
> > 
> _LMFILES_=/opt/modules/oscar-modulefiles/kernel_picker/1.4.1.3:/opt/en
> > 
> v-switcher/share/env-switcher/mpi/mpich-ch_p4-gcc-1.2.5.10:/opt/modules
> > 
> /oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/defau
> > 
> lt-manpath/1.0.1:/opt/modules/oscar-modulefiles/pvm/3.4.4+11:/opt/modul
> >          es/modulefiles/oscar-modules/1.0.5,LANG=en_US.UTF-8,
> > 
> MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-mod
> > 
> ulefiles:/opt/modules/version:/opt/modules/$MODULE_VERSION/modulefiles:
> >          /opt/modules/modulefiles:,
> > 
> LOADEDMODULES=kernel_picker/1.4.1.3:mpi/mpich-ch_p4-gcc-1.2.5.10:switc
> >          her/1.0.13:default-manpath/1.0.1:pvm/3.4.4+11:oscar-modules/1.0.5,
> >          SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=1,
> >          HOME=/home/fcapa,LOGNAME=fcapa,
> >          SSH_CONNECTION=::ffff:150.217.9.147 3199 ::ffff:150.217.9.65 22,
> >          MODULESHOME=/opt/modules/3.1.6,LESSOPEN=|/usr/bin/lesspipe.sh %s,
> >          G_BROKEN_FILENAMES=1,_=/usr/local/bin/qsub,PBS_O_QUEUE=batch
> >      comment = Not Running - PBS Error: Execution server rejected request
> > MSG=se
> >          nd failed, STARTING
> >      etime = Wed Oct 12 11:28:45 2005
> >
> >
> > I couldn't find the messag error ("comment = Not Running - PBS Error:
> > Execution server rejected request MSG=send failed, STARTING") anywhere, 
> not
> > in guides or in forum. I typed "qsub script" to submit the job, where
> > script is:
> >
> > #PBS -l nodes=2,walltime=00:05:00
> > #PBS -e script.err
> > #PBS -o script.out
> > #PBS -V
> > mpirun -np 2 ./hello
> >
> >
> > If I submit a mono-processor job it gives exactly the same error comment,
> > but after some seconds it runs normally and produces right results. I
> > installed and started normally pbs_mom on each node (for now I'm using 
> only
> > three compute nodes, included the server), pbs_server and pbs_sched are
> > normally running, I copied the following configure file:
> >
> > $clienthost     192.168.65.65                      # note: IP address of
> > host running pbs_server
> > $logevent       255
> > $restricted     192.168.65.65                     # note: IP address of
> > host running pbs_server
> > $usecp medusa.dicea.unifi.it:/home /home
> >
> > in mom_priv  directory on each node and the right server_name file in each
> > node. In the server_priv directory of the master there is the correct 
> nodes
> > file. I checked directories spool and undelivered and they're empty. So,
> > what's wrong in this configuration?
> > Please give me some idea to solve this situation!
> > Thank you
> > Lorenzo Campo




More information about the torqueusers mailing list