[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Prakash Velayutham velayups at email.uc.edu
Tue Apr 11 16:24:27 MDT 2006


Garrick Staples wrote:
> On Mon, Apr 10, 2006 at 05:23:25PM -0400, Prakash Velayutham alleged:
>   
>> Garrick Staples wrote:
>>     
>>> On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
>>>  
>>>       
>>>> 11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run 
>>>> at request of Scheduler at ribosome.cchmc.org
>>>>    
>>>> 11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request 
>>>> received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
>>>> 11/08/2005 
>>>>    
>>>>         
>>> Don't see an external job delete...
>>>
>>>  
>>>       
>>>> Here is the mom log:
>>>>
>>>> 11/08/2005 11:22:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
>>>> 50645.ribosome.cchmc.org started, pid = 2806
>>>> 11/08/2005 11:22:31;0008;   
>>>> pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2, 
>>>> sid 2866, cmd /bin/sh
>>>> 11/08/2005 11:23:37;0008;   
>>>> pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2 
>>>> with sig 9
>>>>    
>>>>         
>>> Increase MOM's loglevel over 4, it should log why kill_task is being
>>> called.
>>>       
>> Hi Garrick,
>>
>> I had not gotten time to test this earlier as I had been able to get 
>> some work done with OpenPBS + mpiexec + MPICH. But now I am back to this 
>> as I would really like the multi-server feature of Torque-2.
>>
>> I have Torque-2.0.0p8, MPICH-1.2.7p1, mpiexec-0.80.
>>
>> I tested with mom config as below:
>>
>> $logevent 255
>> $loglevel 7
>>
>> Here is the relevant portion from the MS log.
>>
>> 04/10/2006 16:43:02;0008;   pbs_mom;Job;scan_for_terminated;for job 
>> 51431.ribosome.cchmc.org, task 2, pid=20539, exitcode=0
>>     
>
> The process is exiting without an error.  
>
> Are you running a command in the background or something?
>
> What happens if try this in an interactive job.
I will try an interactive job soon and let you know how it goes.

In the meanwhile, according to the user, the job does complete all the 
steps successfully, except that the message similar to "p1_20540:  
p4_error: interrupt SIGx: 15" shows up in the .OUT file. The 
corresponding .ERR file has the following. I do not know currently if 
always the user gets both the messages (in .ERR and .OUT files). I can 
test that too.

***********************************************************************************
Timeout in waiting for processes to exit, 1 left.  This may be due to a 
defective
rsh program (Some versions of Kerberos rsh have been observed to have this
problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment.  For many applications, this problem will only slow down
process termination.
Timeout in waiting for processes to exit, 1 left.  This may be due to a 
defective
rsh program (Some versions of Kerberos rsh have been observed to have this
problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment.  For many applications, this problem will only slow down
process termination.
Timeout in waiting for processes to exit, 1 left.  This may be due to a 
defective
rsh program (Some versions of Kerberos rsh have been observed to have this
problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment.  For many applications, this problem will only slow down
process termination.
Timeout in waiting for processes to exit, 1 left.  This may be due to a 
defective
rsh program (Some versions of Kerberos rsh have been observed to have this
problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment.  For many applications, this problem will only slow down
process termination.
Timeout in waiting for processes to exit, 1 left.  This may be due to a 
defective
rsh program (Some versions of Kerberos rsh have been observed to have this
problem).
This is not a problem with P4 or MPICH but a problem with the operating
environment.  For many applications, this problem will only slow down
process termination.
************************************************************************************

The above error is consistent when the user runs the code for dataset 
size of 1500 (whatever that means). The point to note is that the same 
code for a smaller dataset (until about 500) does not give these errors 
or messages.

Hmm, that is weird.

Thanks,
Prakash


More information about the torqueusers mailing list