[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Garrick Staples garrick at usc.edu
Mon Apr 17 17:34:17 MDT 2006


On Mon, Apr 17, 2006 at 11:12:50AM -0400, Prakash Velayutham alleged:
> Garrick Staples wrote:
> >On Mon, Apr 10, 2006 at 05:23:25PM -0400, Prakash Velayutham alleged:
> >  
> >>Garrick Staples wrote:
> >>    
> >>>On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
> >>> 
> >>>      
> >>>>11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job 
> >>>>Run at request of Scheduler at ribosome.cchmc.org
> >>>>   
> >>>>11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request 
> >>>>received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
> >>>>11/08/2005 
> >>>>   
> >>>>        
> >>>Don't see an external job delete...
> >>>
> >>> 
> >>>      
> >>>>Here is the mom log:
> >>>>
> >>>>11/08/2005 11:22:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> >>>>50645.ribosome.cchmc.org started, pid = 2806
> >>>>11/08/2005 11:22:31;0008;   
> >>>>pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 
> >>>>2, sid 2866, cmd /bin/sh
> >>>>11/08/2005 11:23:37;0008;   
> >>>>pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2 
> >>>>with sig 9
> >>>>   
> >>>>        
> >>>Increase MOM's loglevel over 4, it should log why kill_task is being
> >>>called.
> >>>      
> >>Hi Garrick,
> >>
> >>I had not gotten time to test this earlier as I had been able to get 
> >>some work done with OpenPBS + mpiexec + MPICH. But now I am back to this 
> >>as I would really like the multi-server feature of Torque-2.
> >>
> >>I have Torque-2.0.0p8, MPICH-1.2.7p1, mpiexec-0.80.
> >>
> >>I tested with mom config as below:
> >>
> >>$logevent 255
> >>$loglevel 7
> >>
> >>Here is the relevant portion from the MS log.
> >>
> >>04/10/2006 16:43:02;0008;   pbs_mom;Job;scan_for_terminated;for job 
> >>51431.ribosome.cchmc.org, task 2, pid=20539, exitcode=0
> >>    
> >The process is exiting without an error.  
> >
> >Are you running a command in the background or something?
> >
> >What happens if try this in an interactive job.
> 
> Hi Garrick,
> 
> I am attaching the user's email on interactive job trial. She sees the 
> same errors (about rsh, and P4) during interactive job submission also.
> 
> 
> hi Prakash,
> 
> Yes I got both the p4 and rsh errors interactively also.
> I get these errors after the completion of first mpiexec. Later calls to 
> mpiexec do not give this error. The code that runs with first mpiexec 
> has a lot of data being sent and received between the child and mother 
> processors. May be this has nothing to do with it...just wanted to let 
> you know.
> 
> Rach
> 
> 
> User also confirmed that the results are exactly the same as she gets 
> with OpenPBS/MPICH/mpiexec combination except for the errors that come 
> along with them. So I am thinking of switching to Torque and telling the 
> users to just ignore the errors. Any suggestions?

I don't know what these error messages are, but it sounds like it has
nothing to do with TORQUE or mpiexec, so I don't really have any
suggestions here.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060417/69405eb1/attachment.bin


More information about the torqueusers mailing list