[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu Sep 15 11:01:18 MDT 2005


Garrick Staples wrote:
>>Question:  Is Torque's LAM-MPI "tm" boot schema supposed to be
>>> working correctly with torque-1.2.0p6 ?  I'd love to get it to
>>> work because of the performance improvements promised in the
>>> LAM-MPI documentation.
>  
> It absolutely should be working.  Can you try something really simple
> like 'pbsdsh hostname' in your job?  Optionally, 'pbsdsh -v hostname'.
> If it is failing, check the mom logs with an increased loglevel.

The result is very interesting, showing obvious errors:

$ pbsdsh -v hostname
pbsdsh: spawned task 0
pbsdsh: spawned task 1
pbsdsh: spawned task 2
pbsdsh: waiting on 3 spawned and 0 obits
spawn event returned: 0
error 17000 on spawn
pbsdsh: waiting on 2 spawned and 0 obits
spawn event returned: 1
error 15010 on spawn
pbsdsh: waiting on 1 spawned and 0 obits
spawn event returned: 2
error 15010 on spawn

I also tried pbsdsh 'echo $PATH', as seen in the logs below,
with the same bad result.  I suppose these errors mean that
the problem is not related to LAM-MPI, but to torque itself.

The master node's mom_logs contain:

09/15/2005 18:45:18;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:18;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:18;0100;   pbs_mom;Req;;Type Commit request received 
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:18;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=14
09/15/2005 18:45:19;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
152.ymer.fysik.dtu.dk started, pid = 24224
09/15/2005 18:45:19;0100;   pbs_mom;Req;;Type ModifyJob request received 
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:19;0008;   pbs_mom;Job;152.ymer.fysik.dtu.dk;Job 
Modified at request of PBS_Server at ymer.dcsc.fysik.dtu.dk
09/15/2005 18:45:38;0001;   pbs_mom;Job;152.ymer.fysik.dtu.dk;task not 
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:46:08;0001;   pbs_mom;Job;152.ymer.fysik.dtu.dk;task not 
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:48:38;0001;   pbs_mom;Job;152.ymer.fysik.dtu.dk;task not 
started, echo $PATH job exec failure, after files staged, no retry
09/15/2005 18:48:51;0080; 
pbs_mom;Job;152.ymer.fysik.dtu.dk;scan_for_terminated: job 
152.ymer.fysik.dtu.dk task 1 terminated, sid 24224
09/15/2005 18:48:51;0008;   pbs_mom;Job;152.ymer.fysik.dtu.dk;Terminated
09/15/2005 18:49:00;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=12

One of the slave nodes' mom_logs contain:

09/15/2005 18:45:18;0008;   pbs_mom;Job;152.ymer.fysik.dtu.dk;JOIN JOB 
as node 1
09/15/2005 18:45:45;0001;   pbs_mom;Job;152.ymer.fysik.dtu.dk;task not 
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:45:45;0008;   pbs_mom;Job;152.ymer.fysik.dtu.dk;ERROR: 
received request 'SPAWN_TASK' from 10.1.130.219:1023 for job
  '152.ymer.fysik.dtu.dk' (cannot start task)
09/15/2005 18:46:14;0001;   pbs_mom;Job;152.ymer.fysik.dtu.dk;task not 
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:46:14;0008;   pbs_mom;Job;152.ymer.fysik.dtu.dk;ERROR: 
received request 'SPAWN_TASK' from 10.1.130.219:1023 for job
  '152.ymer.fysik.dtu.dk' (cannot start task)
09/15/2005 18:48:46;0001;   pbs_mom;Job;152.ymer.fysik.dtu.dk;task not 
started, echo $PATH job exec failure, after files staged, no retry
09/15/2005 18:48:46;0008;   pbs_mom;Job;152.ymer.fysik.dtu.dk;ERROR: 
received request 'SPAWN_TASK' from 10.1.130.219:1023 for job
  '152.ymer.fysik.dtu.dk' (cannot start task)
09/15/2005 18:48:51;0100;   pbs_mom;Job;152.ymer.fysik.dtu.dk;kill_job 
received


Question: What is the meaning of these errors, and what may possibly
be wrong ?

Can you possibly remind me how to increase the MOM log-level,
and how to make pbs_mom reread its config file ?

Thanks a lot,
Ole



More information about the torqueusers mailing list