[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Prakash Velayutham velayups at email.uc.edu
Mon Dec 5 15:12:08 MST 2005


Garrick Staples wrote:
> On Wed, Nov 30, 2005 at 10:50:42AM -0500, Prakash Velayutham alleged:
>   
>> Hi All,
>>
>> I have a cluster with 18 nodes. I have Torque-1.2.0p5 with MPICH-1.2.7 
>> and mpiexec-0.8 running.
>> When I run an MPI application, after a random amount of time, I see that 
>> the job gets killed.
>> The error given is "p0_7360:  p4_error: interrupt SIGx: 15"
>>
>> When I replace Torque with OpenPBS-2.3.16, and everything else remaining 
>> the same, the job goes completes just fine. Of course, I recompiled 
>> mpiexec to use OpenPBS.
>>
>> Any thoughts please?
>>     
>
> Sig 15 is SIGTERM, which means the job is over some sort of limit.
> There should be a corresponding log entry in the MOM log and an email to
> the user.  Also check your scheduler log.
Hi,

Does not seem to give me a clue. Here is the server log for one job that 
died that way:
Please note that there are no resource max set in the server for the 
queue job is running on.

11/08/2005 11:23:32;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from rjain at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type QueueJob request received 
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type JobScript request received 
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type Commit request received 
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 
11:23:32;0100;PBS_Server;Job;50645.ribosome.cchmc.org;enqueuing into 
users, state 1 hop 1
11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job 
Queued at request of rjain at ribosome.cchmc.org, owner = 
rjain at ribosome.cchmc.org, job n
ame = test, queue = users
11/08/2005 11:23:32;0040;PBS_Server;Svr;ribosome.cchmc.org;Scheduler 
sent command new
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type StatusServer request 
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type StatusNode request 
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type StatusQueue request 
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type SelStat request received 
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type SelStat request received 
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type ResourceQuery request 
received from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type ModifyJob request received 
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job 
Modified at request of Scheduler at ribosome.cchmc.org
11/08/2005 11:23:32;0100;PBS_Server;Req;;Type RunJob request received 
from Scheduler at ribosome.cchmc.org, sock=10
11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run 
at request of Scheduler at ribosome.cchmc.org
11/08/2005 11:23:33;0040;PBS_Server;Svr;ribosome.cchmc.org;Scheduler 
sent command recyc
11/08/2005 11:23:33;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from rjain at tyrosine-mpi.bmicluster1.cchmc.org, sock=10
11/08/2005 11:23:33;0100;PBS_Server;Req;;Type StatusJob request received 
from rjain at tyrosine-mpi.bmicluster1.cchmc.org, sock=9
11/08/2005 11:23:33;0100;PBS_Server;Req;;Type StatusJob request received 
from rjain at tyrosine-mpi.bmicluster1.cchmc.org, sock=9
11/08/2005 11:23:34;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from rjain at ribosome.cchmc.org, sock=10
11/08/2005 11:23:34;0100;PBS_Server;Req;;Type StatusServer request 
received from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:23:34;0100;PBS_Server;Req;;Type StatusJob request received 
from rjain at ribosome.cchmc.org, sock=9
11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request 
received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
11/08/2005 
11:24:48;0010;PBS_Server;Job;50645.ribosome.cchmc.org;Exit_status=0 
resources_used.cput=00:00:00 resources_used.mem=28460kb resources_used.vmem=
131892kb resources_used.walltime=00:01:16
11/08/2005 
11:24:48;0100;PBS_Server;Job;50645.ribosome.cchmc.org;dequeuing from 
users, state EXITING
11/08/2005 11:24:48;0040;PBS_Server;Svr;ribosome.cchmc.org;Scheduler 
sent command term

Here is the mom log:

11/08/2005 11:22:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
50645.ribosome.cchmc.org started, pid = 2806
11/08/2005 11:22:31;0008;   
pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2, 
sid 2866, cmd /bin/sh
11/08/2005 11:23:37;0008;   
pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2 
with sig 9
11/08/2005 11:23:42;0008;   
pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2878 task 2 
with sig 9
11/08/2005 11:23:46;0008;   
pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2904 task 2 
with sig 9
11/08/2005 11:23:46;0080;   
pbs_mom;Job;50645.ribosome.cchmc.org;scan_for_terminated: job 
50645.ribosome.cchmc.org task 2 terminated, sid 2866
11/08/2005 11:23:46;0080;   
pbs_mom;Job;50645.ribosome.cchmc.org;scan_for_terminated: job 
50645.ribosome.cchmc.org task 1 terminated, sid 2806
11/08/2005 11:23:46;0008;   pbs_mom;Job;50645.ribosome.cchmc.org;Terminated

Does not seem to help.

In the syslog, just this one line repeats itself.

Nov  8 11:23:02 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:07 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:09 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:17 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:19 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:24 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:27 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:29 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:31 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:32 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:51 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:53 tyrosine kernel: eth1: freeing mc frame.
Nov  8 11:23:55 tyrosine kernel: eth1: freeing mc frame.

Any help?

Thanks,
Prakash


More information about the torqueusers mailing list