[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu Sep 15 11:01:18 MDT 2005
Garrick Staples wrote:
>>Question: Is Torque's LAM-MPI "tm" boot schema supposed to be
>>> working correctly with torque-1.2.0p6 ? I'd love to get it to
>>> work because of the performance improvements promised in the
>>> LAM-MPI documentation.
>
> It absolutely should be working. Can you try something really simple
> like 'pbsdsh hostname' in your job? Optionally, 'pbsdsh -v hostname'.
> If it is failing, check the mom logs with an increased loglevel.
The result is very interesting, showing obvious errors:
$ pbsdsh -v hostname
pbsdsh: spawned task 0
pbsdsh: spawned task 1
pbsdsh: spawned task 2
pbsdsh: waiting on 3 spawned and 0 obits
spawn event returned: 0
error 17000 on spawn
pbsdsh: waiting on 2 spawned and 0 obits
spawn event returned: 1
error 15010 on spawn
pbsdsh: waiting on 1 spawned and 0 obits
spawn event returned: 2
error 15010 on spawn
I also tried pbsdsh 'echo $PATH', as seen in the logs below,
with the same bad result. I suppose these errors mean that
the problem is not related to LAM-MPI, but to torque itself.
The master node's mom_logs contain:
09/15/2005 18:45:18;0100; pbs_mom;Req;;Type QueueJob request received
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:18;0100; pbs_mom;Req;;Type ReadyToCommit request
received from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:18;0100; pbs_mom;Req;;Type Commit request received
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:18;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=14
09/15/2005 18:45:19;0001; pbs_mom;Job;TMomFinalizeJob3;job
152.ymer.fysik.dtu.dk started, pid = 24224
09/15/2005 18:45:19;0100; pbs_mom;Req;;Type ModifyJob request received
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=11
09/15/2005 18:45:19;0008; pbs_mom;Job;152.ymer.fysik.dtu.dk;Job
Modified at request of PBS_Server at ymer.dcsc.fysik.dtu.dk
09/15/2005 18:45:38;0001; pbs_mom;Job;152.ymer.fysik.dtu.dk;task not
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:46:08;0001; pbs_mom;Job;152.ymer.fysik.dtu.dk;task not
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:48:38;0001; pbs_mom;Job;152.ymer.fysik.dtu.dk;task not
started, echo $PATH job exec failure, after files staged, no retry
09/15/2005 18:48:51;0080;
pbs_mom;Job;152.ymer.fysik.dtu.dk;scan_for_terminated: job
152.ymer.fysik.dtu.dk task 1 terminated, sid 24224
09/15/2005 18:48:51;0008; pbs_mom;Job;152.ymer.fysik.dtu.dk;Terminated
09/15/2005 18:49:00;0100; pbs_mom;Req;;Type DeleteJob request received
from PBS_Server at ymer.dcsc.fysik.dtu.dk, sock=12
One of the slave nodes' mom_logs contain:
09/15/2005 18:45:18;0008; pbs_mom;Job;152.ymer.fysik.dtu.dk;JOIN JOB
as node 1
09/15/2005 18:45:45;0001; pbs_mom;Job;152.ymer.fysik.dtu.dk;task not
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:45:45;0008; pbs_mom;Job;152.ymer.fysik.dtu.dk;ERROR:
received request 'SPAWN_TASK' from 10.1.130.219:1023 for job
'152.ymer.fysik.dtu.dk' (cannot start task)
09/15/2005 18:46:14;0001; pbs_mom;Job;152.ymer.fysik.dtu.dk;task not
started, hostname job exec failure, after files staged, no retry
09/15/2005 18:46:14;0008; pbs_mom;Job;152.ymer.fysik.dtu.dk;ERROR:
received request 'SPAWN_TASK' from 10.1.130.219:1023 for job
'152.ymer.fysik.dtu.dk' (cannot start task)
09/15/2005 18:48:46;0001; pbs_mom;Job;152.ymer.fysik.dtu.dk;task not
started, echo $PATH job exec failure, after files staged, no retry
09/15/2005 18:48:46;0008; pbs_mom;Job;152.ymer.fysik.dtu.dk;ERROR:
received request 'SPAWN_TASK' from 10.1.130.219:1023 for job
'152.ymer.fysik.dtu.dk' (cannot start task)
09/15/2005 18:48:51;0100; pbs_mom;Job;152.ymer.fysik.dtu.dk;kill_job
received
Question: What is the meaning of these errors, and what may possibly
be wrong ?
Can you possibly remind me how to increase the MOM log-level,
and how to make pbs_mom reread its config file ?
Thanks a lot,
Ole
More information about the torqueusers
mailing list