[torqueusers] Torque with MPICH kills jobs consistently, but
OpenPBS works fine
velayups at email.uc.edu
Mon Apr 17 09:12:50 MDT 2006
Garrick Staples wrote:
> On Mon, Apr 10, 2006 at 05:23:25PM -0400, Prakash Velayutham alleged:
>> Garrick Staples wrote:
>>> On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
>>>> 11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run
>>>> at request of Scheduler at ribosome.cchmc.org
>>>> 11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request
>>>> received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
>>> Don't see an external job delete...
>>>> Here is the mom log:
>>>> 11/08/2005 11:22:30;0001; pbs_mom;Job;TMomFinalizeJob3;job
>>>> 50645.ribosome.cchmc.org started, pid = 2806
>>>> 11/08/2005 11:22:31;0008;
>>>> pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2,
>>>> sid 2866, cmd /bin/sh
>>>> 11/08/2005 11:23:37;0008;
>>>> pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2
>>>> with sig 9
>>> Increase MOM's loglevel over 4, it should log why kill_task is being
>> Hi Garrick,
>> I had not gotten time to test this earlier as I had been able to get
>> some work done with OpenPBS + mpiexec + MPICH. But now I am back to this
>> as I would really like the multi-server feature of Torque-2.
>> I have Torque-2.0.0p8, MPICH-1.2.7p1, mpiexec-0.80.
>> I tested with mom config as below:
>> $logevent 255
>> $loglevel 7
>> Here is the relevant portion from the MS log.
>> 04/10/2006 16:43:02;0008; pbs_mom;Job;scan_for_terminated;for job
>> 51431.ribosome.cchmc.org, task 2, pid=20539, exitcode=0
> The process is exiting without an error.
> Are you running a command in the background or something?
> What happens if try this in an interactive job.
I am attaching the user's email on interactive job trial. She sees the
same errors (about rsh, and P4) during interactive job submission also.
Yes I got both the p4 and rsh errors interactively also.
I get these errors after the completion of first mpiexec. Later calls to
mpiexec do not give this error. The code that runs with first mpiexec
has a lot of data being sent and received between the child and mother
processors. May be this has nothing to do with it...just wanted to let
User also confirmed that the results are exactly the same as she gets
with OpenPBS/MPICH/mpiexec combination except for the errors that come
along with them. So I am thinking of switching to Torque and telling the
users to just ignore the errors. Any suggestions?
More information about the torqueusers