[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Prakash Velayutham velayups at email.uc.edu
Mon Apr 10 15:23:25 MDT 2006


Garrick Staples wrote:
> On Mon, Dec 05, 2005 at 05:12:08PM -0500, Prakash Velayutham alleged:
>   
>> 11/08/2005 11:23:32;0008;PBS_Server;Job;50645.ribosome.cchmc.org;Job Run 
>> at request of Scheduler at ribosome.cchmc.org
>>     
>> 11/08/2005 11:24:48;0100;PBS_Server;Req;;Type JobObituary request 
>> received from pbs_mom at tyrosine.bmicluster1.cchmc.org, sock=9
>> 11/08/2005 
>>     
> Don't see an external job delete...
>
>   
>> Here is the mom log:
>>
>> 11/08/2005 11:22:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
>> 50645.ribosome.cchmc.org started, pid = 2806
>> 11/08/2005 11:22:31;0008;   
>> pbs_mom;Job;50645.ribosome.cchmc.org;start_process: task started, tid 2, 
>> sid 2866, cmd /bin/sh
>> 11/08/2005 11:23:37;0008;   
>> pbs_mom;Job;50645.ribosome.cchmc.org;kill_task: killing pid 2877 task 2 
>> with sig 9
>>     
> Increase MOM's loglevel over 4, it should log why kill_task is being
> called.
Hi Garrick,

I had not gotten time to test this earlier as I had been able to get 
some work done with OpenPBS + mpiexec + MPICH. But now I am back to this 
as I would really like the multi-server feature of Torque-2.

I have Torque-2.0.0p8, MPICH-1.2.7p1, mpiexec-0.80.

I tested with mom config as below:

$logevent 255
$loglevel 7

Here is the relevant portion from the MS log.

04/10/2006 16:43:01;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
sending to server "jobs=51431.ribosome.cchmc.org"
04/10/2006 16:43:01;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to ribosome-nfs
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20506 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20508 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20534 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20536 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20538 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20539 
pid=20539 cputime=1436 (cputfactor=1.000000)
04/10/2006 16:43:02;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20539 
pid=20540 cputime=2901 (cputfactor=1.000000)
04/10/2006 16:43:02;0008;   pbs_mom;Job;scan_for_terminated;for job 
51431.ribosome.cchmc.org, task 2, pid=20539, exitcode=0
04/10/2006 16:43:02;0008;   pbs_mom;Job;51431.ribosome.cchmc.org;sending 
signal 9 to task
04/10/2006 16:43:07;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;kill_task: killing pid 20540 task 2 
with sig 9
04/10/2006 16:43:07;0008;   pbs_mom;Svr;task_save;saving task in 
/var/spool/torque/mom_priv/jobs/51431.ribos.TK/0000000002
04/10/2006 16:43:07;0080;   
pbs_mom;Job;51431.ribosome.cchmc.org;scan_for_terminated: job 
51431.ribosome.cchmc.org task 2 terminated, sid 20539
04/10/2006 16:43:08;0008;   pbs_mom;Svr;task_save;saving task in 
/var/spool/torque/mom_priv/jobs/51431.ribos.TK/0000000002
04/10/2006 16:43:08;0008;   pbs_mom;Job;tcp_request;tcp_request: fd 11 
addr 127.0.0.1:30187
04/10/2006 16:43:08;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;matching task located, marking 
interface closed
04/10/2006 16:43:08;0080;   pbs_mom;Svr;close_conn;closed connection to 
fd 11 - num_connections=4
04/10/2006 16:43:08;0002;   pbs_mom;n/a;mom_main;hello sent to server 
arginine
04/10/2006 16:43:08;0002;   pbs_mom;n/a;mom_main;hello sent to server 
cysteine
04/10/2006 16:43:08;0002;   pbs_mom;n/a;mom_main;hello sent to server 
glutamine
04/10/2006 16:43:09;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20506 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:09;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=20506 
pid=20508 cputime=0 (cputfactor=1.000000)
04/10/2006 16:43:09;0008;   pbs_mom;Job;scan_for_terminated;for job 
51431.ribosome.cchmc.org, task 1, pid=20506, exitcode=0
04/10/2006 16:43:09;0008;   pbs_mom;Job;51431.ribosome.cchmc.org;sending 
signal 9 to task
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;kill_task: killing pid 20508 task 1 
with sig 9
04/10/2006 16:43:09;0008;   pbs_mom;Svr;task_save;saving task in 
/var/spool/torque/mom_priv/jobs/51431.ribos.TK/0000000001
04/10/2006 16:43:09;0080;   
pbs_mom;Job;51431.ribosome.cchmc.org;scan_for_terminated: job 
51431.ribosome.cchmc.org task 1 terminated, sid 20506
04/10/2006 16:43:09;0008;   pbs_mom;Job;51431.ribosome.cchmc.org;Terminated
04/10/2006 16:43:09;0008;   pbs_mom;Req;send_sisters;sending command 
KILL_JOB for job 51431.ribosome.cchmc.org (2)
04/10/2006 16:43:09;0008;   pbs_mom;Svr;task_save;saving task in 
/var/spool/torque/mom_priv/jobs/51431.ribos.TK/0000000001
04/10/2006 16:43:09;0008;   pbs_mom;Job;do_rpp;got an internal task 
manager request in do_rpp
04/10/2006 16:43:09;0002;   pbs_mom;Svr;im_request;connect from 
192.168.2.106:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;received request 'ALL_OKAY' from 
192.168.2.106:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;KILL_JOB acknowledgement received
04/10/2006 16:43:09;0008;   pbs_mom;Job;do_rpp;got an internal task 
manager request in do_rpp
04/10/2006 16:43:09;0002;   pbs_mom;Svr;im_request;connect from 
192.168.2.103:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;received request 'ALL_OKAY' from 
192.168.2.103:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;KILL_JOB acknowledgement received
04/10/2006 16:43:09;0008;   pbs_mom;Job;do_rpp;got an internal task 
manager request in do_rpp
04/10/2006 16:43:09;0002;   pbs_mom;Svr;im_request;connect from 
192.168.2.104:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;received request 'ALL_OKAY' from 
192.168.2.104:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;KILL_JOB acknowledgement received
04/10/2006 16:43:09;0008;   pbs_mom;Job;do_rpp;got an internal task 
manager request in do_rpp
04/10/2006 16:43:09;0002;   pbs_mom;Svr;im_request;connect from 
192.168.2.105:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;received request 'ALL_OKAY' from 
192.168.2.105:15027
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;KILL_JOB acknowledgement received
04/10/2006 16:43:09;0080;   pbs_mom;Job;51431.ribosome.cchmc.org;local 
task termination detected.  killing job
04/10/2006 16:43:09;0008;   pbs_mom;Job;51431.ribosome.cchmc.org;kill_job
04/10/2006 16:43:09;0002;   pbs_mom;n/a;run_pelog;userepilog script 
'/var/spool/torque/mom_priv/epilogue.precancel' does not exist (cwd: 
/var/rw/spool/torq
ue/mom_priv)
04/10/2006 16:43:09;0008;   
pbs_mom;Job;51431.ribosome.cchmc.org;kill_job done


Any help / insight greatly appreciated.

TIA,
Prakash


More information about the torqueusers mailing list