[torqueusers] jobs termination pbsdsh
Delphine Ramalingom
delphine.ramalingom at univ-reunion.fr
Thu Nov 22 21:42:11 MST 2012
Hi eveybody,
Some of jobs (not all) are terminating before the requested walltime
when we used pbsdsh and we are getting a message in /var/spool/mail file
that these jobs have exceeded the wallclock time.
Is there a reason for this that I don't know ? Can you help me ?
We used :
- Maui - version 3.3.
- Torque - version 4.0.2
When I used tracejob, I have :
Job: 446.metis.univ.run
11/22/2012 13:51:33 S enqueuing into pbsdsh, state 1 hop 1
11/22/2012 13:51:33 S Job Queued at request of smahajan at metis.univ.run,
owner = smahajan at metis.univ.run, job name =
test_20_40, queue = pbsdsh
11/22/2012 13:51:33 A queue=pbsdsh
11/22/2012 13:51:35 S Job Run at request of root at metis.univ.run
11/22/2012 13:51:35 M start_process: task started, tid 2, sid 5479, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35 M start_process: task started, tid 3, sid 5487, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35 M start_process: task started, tid 4, sid 5505, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35 M start_process: task started, tid 5, sid 5519, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35 M start_process: task started, tid 6, sid 5545, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35 A user=smahajan group=DSIMB jobname=test_20_40
queue=pbsdsh ctime=1353577893 qtime=1353577893
etime=1353577893 start=1353577895
owner=smahajan at metis.univ.run
exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
Resource_List.mem=8gb
Resource_List.neednodes=1:ppn=8
Resource_List.nodect=1
Resource_List.nodes=1:ppn=8
Resource_List.walltime=24:00:00
11/22/2012 13:51:36 M start_process: task started, tid 7, sid 5594, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:36 M start_process: task started, tid 8, sid 5667, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:36 M start_process: task started, tid 9, sid 5749, cmd
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 19:05:07 M scan_for_terminated: job 446.metis.univ.run task 2
terminated, sid=5479
11/22/2012 19:05:07 M scan_for_terminated: job 446.metis.univ.run task 1
terminated, sid=5452
11/22/2012 19:05:07 M kill_task: killing pid 5487 task 3 gracefully
with sig
15
11/22/2012 19:05:07 M kill_task: process (pid=5487/state=Z) after sig 15
11/22/2012 19:05:07 M kill_task: killing pid 5500 task 3 gracefully
with sig
15
11/22/2012 19:05:07 M kill_task: process (pid=5500/state=Z) after sig 15
11/22/2012 19:05:07 M kill_task: killing pid 5505 task 4 gracefully
with sig
15
11/22/2012 19:05:07 M kill_task: process (pid=5505/state=R) after sig 15
11/22/2012 19:05:07 M kill_task: process (pid=5505/state=Z) after sig 15
11/22/2012 19:05:07 M kill_task: killing pid 5515 task 4 gracefully
with sig
15
11/22/2012 19:05:07 M kill_task: process (pid=5515/state=R) after sig 15
11/22/2012 19:05:08 M kill_task: killing pid 5519 task 5 gracefully
with sig
15
11/22/2012 19:05:08 M kill_task: process (pid=5519/state=S) after sig 15
11/22/2012 19:05:08 M kill_task: process (pid=5519/state=Z) after sig 15
11/22/2012 19:05:08 M kill_task: killing pid 5537 task 5 gracefully
with sig
15
11/22/2012 19:05:08 M kill_task: process (pid=5537/state=R) after sig 15
11/22/2012 19:05:08 M kill_task: killing pid 5545 task 6 gracefully
with sig
15
11/22/2012 19:05:08 M kill_task: process (pid=5545/state=R) after sig 15
11/22/2012 19:05:09 M kill_task: process (pid=5545/state=Z) after sig 15
11/22/2012 19:05:09 M kill_task: killing pid 5572 task 6 gracefully
with sig
15
11/22/2012 19:05:09 M kill_task: process (pid=5572/state=R) after sig 15
11/22/2012 19:05:09 M kill_task: killing pid 5594 task 7 gracefully
with sig
15
11/22/2012 19:05:09 M kill_task: process (pid=5594/state=S) after sig 15
11/22/2012 19:05:09 M kill_task: process (pid=5594/state=Z) after sig 15
11/22/2012 19:05:09 M kill_task: killing pid 5633 task 7 gracefully
with sig
15
11/22/2012 19:05:09 M kill_task: process (pid=5633/state=R) after sig 15
11/22/2012 19:05:09 M kill_task: killing pid 5667 task 8 gracefully
with sig
15
11/22/2012 19:05:09 M kill_task: process (pid=5667/state=S) after sig 15
11/22/2012 19:05:10 M kill_task: process (pid=5667/state=Z) after sig 15
11/22/2012 19:05:10 M kill_task: killing pid 5715 task 8 gracefully
with sig
15
11/22/2012 19:05:10 M kill_task: process (pid=5715/state=R) after sig 15
11/22/2012 19:05:10 M kill_task: killing pid 5749 task 9 gracefully
with sig
15
11/22/2012 19:05:10 M kill_task: process (pid=5749/state=S) after sig 15
11/22/2012 19:05:10 M kill_task: process (pid=5749/state=Z) after sig 15
11/22/2012 19:05:10 M kill_task: killing pid 5807 task 9 gracefully
with sig
15
11/22/2012 19:05:10 M kill_task: process (pid=5807/state=R) after sig 15
11/22/2012 19:05:11 S Not sending email: User does not want mail of this
type.
11/22/2012 19:05:11 S Exit_status=2
11/22/2012 19:05:11 S dequeuing from pbsdsh, state COMPLETE
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 3
terminated, sid=5487
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 4
terminated, sid=5505
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 5
terminated, sid=5519
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 6
terminated, sid=5545
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 7
terminated, sid=5594
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 8
terminated, sid=5667
11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task 9
terminated, sid=5749
11/22/2012 19:05:11 M obit sent to server
11/22/2012 19:05:11 S on_job_exit valid pjob: 0x7f9a7c0ae6e0
(substate=50)
11/22/2012 19:05:11 M removed job script
11/22/2012 19:05:11 A user=smahajan group=DSIMB jobname=test_20_40
queue=pbsdsh ctime=1353577893 qtime=1353577893
etime=1353577893 start=1353577895
owner=smahajan at metis.univ.run
exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
Resource_List.mem=8gb
Resource_List.neednodes=1:ppn=8
Resource_List.nodect=1
Resource_List.nodes=1:ppn=8
Resource_List.walltime=24:00:00 session=5452
end=1353596711 Exit_status=2
Regards,
Delphine
More information about the torqueusers
mailing list