[torqueusers] Torque with MPICH kills jobs consistently,
but OpenPBS works fine
Garrick Staples
garrick at usc.edu
Mon Dec 5 15:46:02 MST 2005
On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
> 11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run
> at request of Scheduler at ribosome.cchmc.org
> 11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request
> received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
> 11/08/2005
Don't see an external job delete...
> Here is the mom log:
>
> 11/08/2005 11:22:30;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 50645.ribosome.cchmc.org started, pid = 2806
> 11/08/2005 11:22:31;0008;
> pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2,
> sid 2866, cmd /bin/sh
> 11/08/2005 11:23:37;0008;
> pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2
> with sig 9
Increase MOM's loglevel over 4, it should log why kill_task is being
called.
> Does not seem to help.
>
> In the syslog, just this one line repeats itself.
>
> Nov 8 11:23:02 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:07 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:09 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:11 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:17 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:19 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:24 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:27 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:29 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:31 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:32 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:43 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:51 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:53 tyrosine kernel: eth1: freeing mc frame.
> Nov 8 11:23:55 tyrosine kernel: eth1: freeing mc frame.
You've got ethernet driver problems. I'd recommend using e100 instead
of eepro100.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051205/20829c12/attachment.bin
More information about the torqueusers
mailing list