[torqueusers] qalter -lwalltime not propagated to slave moms

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Mon Nov 28 02:37:24 MST 2005


Dear All,

at least on our cluster, it seems that changes with qalter to the
walltime after the jobs is started are not correctly propagated to
sister moms. As a consequence, parallel jobs started with Pete's
mpiexec get killed once the original walltime is exceeded.


Kind regards,

Thomas Zeiser


==== Installed software ====================================
maui client version 3.2.6p13
torque 2.0.0p2
mpiexec-0.80

==== Job output ============================================
unrz at sfront03:> qsub -q iband -l nodes=2:ppn=2,walltime=00:05:00 -I
qsub: waiting for job 2411.sserver01 to start
qsub: job 2411.sserver01 ready
unrz at snode160:> mpirun -np 4 ../bin/lesocc-ib1.8
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
[1] Abort: [snode160:1] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81
 at line 1804 in file viacheck.c
mpiexec: Warning: accept_abort_conn: MPI_Abort from IP 192.168.80.160, rank 1, killing all.
forrtl: error (78): process killed (SIGTERM)
mpiexec: Warning: tasks 0,2-3 exited with status 1.
mpiexec: Warning: task 1 exited with status 252.

==== Changes made as root ==================================
sserver01:~# qstat -au unrz
sserver01: 
Job ID             Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
------------------ -------- -------- ---------- ------ --- --- ------ ----- - -----
2411.sserver01       unrz  iband    STDIN         --      2  --    --  00:05 R 00:01

sserver01:~# qalter -lwalltime=1:00:00 2411
(executed at 10:26)

sserver01:~#  qstat -au unrz
sserver01: 
Job ID             Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
------------------ -------- -------- ---------- ------ --- --- ------ ----- - -----
2411.sserver01       unrz  iband    STDIN         --      2  --    --  01:00 R 00:02

==== Master node ===========================================
11/28/2005 10:23:52;0008;   pbs_mom;Job;2411.sserver01;Job Modified at request of PBS_Server at sserver01
11/28/2005 10:23:53;0001;   pbs_mom;Job;TMomFinalizeJob3;job 2411.sserver01 started, pid = 31166
11/28/2005 10:25:20;0008;   pbs_mom;Job;2411.sserver01;start_process: task started, tid 2, sid 31305, cmd /bin/sh
11/28/2005 10:25:20;0008;   pbs_mom;Job;2411.sserver01;start_process: task started, tid 3, sid 31306, cmd /bin/sh
11/28/2005 10:26:47;0008;   pbs_mom;Job;2411.sserver01;Job Modified at request of PBS_Server at sserver01
11/28/2005 10:29:25;0008;   pbs_mom;Job;2411.sserver01;kill_task: killing pid 31305 task 2 with sig 9
11/28/2005 10:29:25;0008;   pbs_mom;Job;2411.sserver01;kill_task: not killing pid 0 with sig 9

==== Sister node ===========================================
11/28/2005 10:23:53;0008;   pbs_mom;Job;2411.sserver01;JOIN JOB as node 1
11/28/2005 10:25:20;0008;   pbs_mom;Job;2411.sserver01;start_process: task started, tid 4, sid 7448, cmd /bin/sh
11/28/2005 10:25:20;0008;   pbs_mom;Job;2411.sserver01;start_process: task started, tid 5, sid 7453, cmd /bin/sh
11/28/2005 10:29:20;0008;   pbs_mom;Job;2411.sserver01;walltime 328 exceeded limit 300
11/28/2005 10:29:20;0008;   pbs_mom;Job;2411.sserver01;kill_task: killing pid 7448 task 4 with sig 15
11/28/2005 10:29:20;0008;   pbs_mom;Job;2411.sserver01;kill_task: killing pid 7453 task 5 with sig 15

-- 
Dipl.-Ing. Thomas ZEISER
Regionales Rechenzentrum Erlangen / GERMANY


More information about the torqueusers mailing list