[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Garrick Staples garrick at usc.edu
Mon Dec 5 15:46:02 MST 2005


On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
> 11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run 
> at request of Scheduler at ribosome.cchmc.org

> 11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request 
> received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
> 11/08/2005 

Don't see an external job delete...


> Here is the mom log:
> 
> 11/08/2005 11:22:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> 50645.ribosome.cchmc.org started, pid = 2806
> 11/08/2005 11:22:31;0008;   
> pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2, 
> sid 2866, cmd /bin/sh
> 11/08/2005 11:23:37;0008;   
> pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2 
> with sig 9

Increase MOM's loglevel over 4, it should log why kill_task is being
called.

 
> Does not seem to help.
> 
> In the syslog, just this one line repeats itself.
> 
> Nov  8 11:23:02 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:07 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:09 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:17 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:19 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:24 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:27 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:29 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:31 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:32 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:51 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:53 tyrosine kernel: eth1: freeing mc frame.
> Nov  8 11:23:55 tyrosine kernel: eth1: freeing mc frame.

You've got ethernet driver problems.  I'd recommend using e100 instead
of eepro100.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051205/20829c12/attachment.bin


More information about the torqueusers mailing list