[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Prakash Velayutham velayups at email.uc.edu
Mon Apr 17 09:12:50 MDT 2006


Garrick Staples wrote:
> On Mon, Apr 10, 2006 at 05:23:25PM -0400, Prakash Velayutham alleged:
>   
>> Garrick Staples wrote:
>>     
>>> On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
>>>  
>>>       
>>>> 11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run 
>>>> at request of Scheduler at ribosome.cchmc.org
>>>>    
>>>> 11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request 
>>>> received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
>>>> 11/08/2005 
>>>>    
>>>>         
>>> Don't see an external job delete...
>>>
>>>  
>>>       
>>>> Here is the mom log:
>>>>
>>>> 11/08/2005 11:22:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
>>>> 50645.ribosome.cchmc.org started, pid = 2806
>>>> 11/08/2005 11:22:31;0008;   
>>>> pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2, 
>>>> sid 2866, cmd /bin/sh
>>>> 11/08/2005 11:23:37;0008;   
>>>> pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2 
>>>> with sig 9
>>>>    
>>>>         
>>> Increase MOM's loglevel over 4, it should log why kill_task is being
>>> called.
>>>       
>> Hi Garrick,
>>
>> I had not gotten time to test this earlier as I had been able to get 
>> some work done with OpenPBS + mpiexec + MPICH. But now I am back to this 
>> as I would really like the multi-server feature of Torque-2.
>>
>> I have Torque-2.0.0p8, MPICH-1.2.7p1, mpiexec-0.80.
>>
>> I tested with mom config as below:
>>
>> $logevent 255
>> $loglevel 7
>>
>> Here is the relevant portion from the MS log.
>>
>> 04/10/2006 16:43:02;0008;   pbs_mom;Job;scan_for_terminated;for job 
>> 51431.ribosome.cchmc.org, task 2, pid=20539, exitcode=0
>>     
> The process is exiting without an error.  
>
> Are you running a command in the background or something?
>
> What happens if try this in an interactive job.

Hi Garrick,

I am attaching the user's email on interactive job trial. She sees the 
same errors (about rsh, and P4) during interactive job submission also.


hi Prakash,

Yes I got both the p4 and rsh errors interactively also.
I get these errors after the completion of first mpiexec. Later calls to 
mpiexec do not give this error. The code that runs with first mpiexec 
has a lot of data being sent and received between the child and mother 
processors. May be this has nothing to do with it...just wanted to let 
you know.

Rach


User also confirmed that the results are exactly the same as she gets 
with OpenPBS/MPICH/mpiexec combination except for the errors that come 
along with them. So I am thinking of switching to Torque and telling the 
users to just ignore the errors. Any suggestions?

TIA,
Prakash


More information about the torqueusers mailing list