[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6

Troy Baer troy at osc.edu
Thu Sep 15 13:24:39 MDT 2005


On Thu, 2005-09-15 at 21:00 +0200, Ole Holm Nielsen wrote:
> Garrick Staples wrote:
> > Can you repeat that with a single-node, single-proc job please?
> > 
> > How is the job requested?  Any special limits like mem, vmem, file,
> > etc.?  Is -d or -D used?
> 
> I don't do anything special.
> 
> Here is the result, and on 1 node pbsdsh works:
> 
> # qsub -I -l nodes=1:d510
> qsub: waiting for job 154.ymer.fysik.dtu.dk to start
> qsub: job 154.ymer.fysik.dtu.dk ready
> 
> [ohnielse at n469 ~]$ pbsdsh -v hostname
> pbsdsh: spawned task 0
> pbsdsh: waiting on 1 spawned and 0 obits
> n469.dcsc.fysik.dtu.dk
> spawn event returned: 0
> pbsdsh: sending obit for task 2
> pbsdsh: waiting on 0 spawned and 1 obits
> obit event returned: 0
> pbsdsh: task 0 exit status 0
> 
> However, on 3 nodes it still fails:
> 
> # qsub -I -l nodes=3:d510
> qsub: waiting for job 155.ymer.fysik.dtu.dk to start
> qsub: job 155.ymer.fysik.dtu.dk ready
> 
> [ohnielse at n469 ~]$ pbsdsh -v hostname
> pbsdsh: spawned task 0
> pbsdsh: spawned task 1
> pbsdsh: spawned task 2
> pbsdsh: waiting on 3 spawned and 0 obits
> spawn event returned: 0
> error 17000 on spawn
> pbsdsh: waiting on 2 spawned and 0 obits
> spawn event returned: 2
> error 15010 on spawn
> pbsdsh: waiting on 1 spawned and 0 obits
> spawn event returned: 1
> error 15010 on spawn

Both of those error codes mean "system error", which AFAICT actually
means "I have no idea what just happened".

$ grep 17000 /usr/local/pbs/include/*.h
#define    TM_ESYSTEM              17000

$ grep 15010 /usr/local/pbs/include/*.h
#define PBSE_SYSTEM 15010    /* system error occurred */

You might was to fire up tcpdump or something like that on one of the
subordinate nodes and watch what goes over the network when you run one
of these pbsdsh's.

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701



More information about the torqueusers mailing list