[torqueusers] Torque with MPICH kills jobs consistently, but
OpenPBS works fine
Prakash Velayutham
velayups at email.uc.edu
Mon Dec 5 15:12:08 MST 2005
Garrick Staples wrote:
> On Wed, Nov 30, 2005 at 10:50:42AM -0500, Prakash Velayutham alleged:
>
>> Hi All,
>>
>> I have a cluster with 18 nodes. I have Torque-1.2.0p5 with MPICH-1.2.7
>> and mpiexec-0.8 running.
>> When I run an MPI application, after a random amount of time, I see that
>> the job gets killed.
>> The error given is "p0_7360: p4_error: interrupt SIGx: 15"
>>
>> When I replace Torque with OpenPBS-2.3.16, and everything else remaining
>> the same, the job goes completes just fine. Of course, I recompiled
>> mpiexec to use OpenPBS.
>>
>> Any thoughts please?
>>
>
> Sig 15 is SIGTERM, which means the job is over some sort of limit.
> There should be a corresponding log entry in the MOM log and an email to
> the user. Also check your scheduler log.
Hi,
Does not seem to give me a clue. Here is the server log for one job that
died that way:
Please note that there are no resource max set in the server for the
queue job is running on.
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type AuthenticateUser request
received from rjain at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type QueueJob request received
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type JobScript request received
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type ReadyToCommit request
received from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type Commit request received
from rjain at ribosome.cchmc.org, sock=9
11/08/2005
11:23:32;0100;PBS_Server;Job;50645.ribosome.cchmc.org;enqueuing into
users, state 1 hop 1
11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job
Queued at request of rjain at ribosome.cchmc.org, owner =
rjain at ribosome.cchmc.org, job n
ame = test, queue = users
11/08/2005 11:23:32;0040;PBS_Server;Svr;ribosome.cchmc.org;Scheduler
sent command new
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job
Modified at request of Scheduler at ribosome.cchmc.org
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run
at request of Scheduler at ribosome.cchmc.org
11/08/2005 11:23:33;0040;PBS_Server;Svr;ribosome.cchmc.org;Scheduler
sent command recyc
11/08/2005 11:23:33;0100;PBS_Server;Req;;Type AuthenticateUser request
received from rjain at tyrosine-mpi.bmicluster1.cchmc.org, sock=10
11/08/2005 11:23:33;0100;PBS_Server;Req;;Type StatusJob request received
from rjain at tyrosine-mpi.bmicluster1.cchmc.org, sock=9
11/08/2005 11:23:33;0100;PBS_Server;Req;;Type StatusJob request received
from rjain at tyrosine-mpi.bmicluster1.cchmc.org, sock=9
11/08/2005 11:23:34;0100;PBS_Server;Req;;Type AuthenticateUser request
received from rjain at ribosome.cchmc.org, sock=10
11/08/2005 11:23:34;0100;PBS_Server;Req;;Type StatusServer request
received from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:34;0100;PBS_Server;Req;;Type StatusJob request received
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
11/08/2005
11:24:48;0010;PBS_Server;Job;50645.ribosome.cchmc.org;Exit_status=0
resources_used.cput=00:00:00 resources_used.mem=28460kb resources_used.vmem=
131892kb resources_used.walltime=00:01:16
11/08/2005
11:24:48;0100;PBS_Server;Job;50645.ribosome.cchmc.org;dequeuing from
users, state EXITING
11/08/2005 11:24:48;0040;PBS_Server;Svr;ribosome.cchmc.org;Scheduler
sent command term
Here is the mom log:
11/08/2005 11:22:30;0001; pbs_mom;Job;TMomFinalizeJob3;job
50645.ribosome.cchmc.org started, pid = 2806
11/08/2005 11:22:31;0008;
pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2,
sid 2866, cmd /bin/sh
11/08/2005 11:23:37;0008;
pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2
with sig 9
11/08/2005 11:23:42;0008;
pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2878 task 2
with sig 9
11/08/2005 11:23:46;0008;
pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2904 task 2
with sig 9
11/08/2005 11:23:46;0080;
pbs_mom;Job;50645.ribosome.cchmc.org;scan_for_terminated: job
50645.ribosome.cchmc.org task 2 terminated, sid 2866
11/08/2005 11:23:46;0080;
pbs_mom;Job;50645.ribosome.cchmc.org;scan_for_terminated: job
50645.ribosome.cchmc.org task 1 terminated, sid 2806
11/08/2005 11:23:46;0008; pbs_mom;Job;50645.ribosome.cchmc.org;Terminated
Does not seem to help.
In the syslog, just this one line repeats itself.
Nov 8 11:23:02 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:07 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:09 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:17 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:19 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:24 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:27 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:29 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:31 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:32 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:51 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:53 tyrosine kernel: eth1: freeing mc frame.
Nov 8 11:23:55 tyrosine kernel: eth1: freeing mc frame.
Any help?
Thanks,
Prakash
More information about the torqueusers
mailing list