[torqueusers] jobs termination pbsdsh

Delphine Ramalingom delphine.ramalingom at univ-reunion.fr
Thu Nov 22 21:42:11 MST 2012


Hi eveybody,

Some of jobs (not all) are terminating before the requested walltime 
when we used pbsdsh and we are getting a message in /var/spool/mail file 
that these jobs have exceeded the wallclock time.
Is there a reason for this that I don't know ? Can you help me ?

We used :
- Maui - version 3.3.
- Torque - version 4.0.2

When I used tracejob, I have :

Job: 446.metis.univ.run

11/22/2012 13:51:33  S    enqueuing into pbsdsh, state 1 hop 1
11/22/2012 13:51:33  S    Job Queued at request of smahajan at metis.univ.run,
                           owner = smahajan at metis.univ.run, job name =
                           test_20_40, queue = pbsdsh
11/22/2012 13:51:33  A    queue=pbsdsh
11/22/2012 13:51:35  S    Job Run at request of root at metis.univ.run
11/22/2012 13:51:35  M    start_process: task started, tid 2, sid 5479, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35  M    start_process: task started, tid 3, sid 5487, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35  M    start_process: task started, tid 4, sid 5505, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35  M    start_process: task started, tid 5, sid 5519, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35  M    start_process: task started, tid 6, sid 5545, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:35  A    user=smahajan group=DSIMB jobname=test_20_40
                           queue=pbsdsh ctime=1353577893 qtime=1353577893
                           etime=1353577893 start=1353577895
                           owner=smahajan at metis.univ.run
                           
exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
                           Resource_List.mem=8gb 
Resource_List.neednodes=1:ppn=8
                           Resource_List.nodect=1 
Resource_List.nodes=1:ppn=8
                           Resource_List.walltime=24:00:00
11/22/2012 13:51:36  M    start_process: task started, tid 7, sid 5594, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:36  M    start_process: task started, tid 8, sid 5667, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 13:51:36  M    start_process: task started, tid 9, sid 5749, cmd
                           
/labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
11/22/2012 19:05:07  M    scan_for_terminated: job 446.metis.univ.run task 2
                           terminated, sid=5479
11/22/2012 19:05:07  M    scan_for_terminated: job 446.metis.univ.run task 1
                           terminated, sid=5452
11/22/2012 19:05:07  M    kill_task: killing pid 5487 task 3 gracefully 
with sig
                           15
11/22/2012 19:05:07  M    kill_task: process (pid=5487/state=Z) after sig 15
11/22/2012 19:05:07  M    kill_task: killing pid 5500 task 3 gracefully 
with sig
                           15
11/22/2012 19:05:07  M    kill_task: process (pid=5500/state=Z) after sig 15
11/22/2012 19:05:07  M    kill_task: killing pid 5505 task 4 gracefully 
with sig
                           15
11/22/2012 19:05:07  M    kill_task: process (pid=5505/state=R) after sig 15
11/22/2012 19:05:07  M    kill_task: process (pid=5505/state=Z) after sig 15
11/22/2012 19:05:07  M    kill_task: killing pid 5515 task 4 gracefully 
with sig
                           15
11/22/2012 19:05:07  M    kill_task: process (pid=5515/state=R) after sig 15
11/22/2012 19:05:08  M    kill_task: killing pid 5519 task 5 gracefully 
with sig
                           15
11/22/2012 19:05:08  M    kill_task: process (pid=5519/state=S) after sig 15
11/22/2012 19:05:08  M    kill_task: process (pid=5519/state=Z) after sig 15
11/22/2012 19:05:08  M    kill_task: killing pid 5537 task 5 gracefully 
with sig
                           15
11/22/2012 19:05:08  M    kill_task: process (pid=5537/state=R) after sig 15
11/22/2012 19:05:08  M    kill_task: killing pid 5545 task 6 gracefully 
with sig
                           15
11/22/2012 19:05:08  M    kill_task: process (pid=5545/state=R) after sig 15
11/22/2012 19:05:09  M    kill_task: process (pid=5545/state=Z) after sig 15
11/22/2012 19:05:09  M    kill_task: killing pid 5572 task 6 gracefully 
with sig
                           15
11/22/2012 19:05:09  M    kill_task: process (pid=5572/state=R) after sig 15
11/22/2012 19:05:09  M    kill_task: killing pid 5594 task 7 gracefully 
with sig
                           15
11/22/2012 19:05:09  M    kill_task: process (pid=5594/state=S) after sig 15
11/22/2012 19:05:09  M    kill_task: process (pid=5594/state=Z) after sig 15
11/22/2012 19:05:09  M    kill_task: killing pid 5633 task 7 gracefully 
with sig
                           15
11/22/2012 19:05:09  M    kill_task: process (pid=5633/state=R) after sig 15
11/22/2012 19:05:09  M    kill_task: killing pid 5667 task 8 gracefully 
with sig
                           15
11/22/2012 19:05:09  M    kill_task: process (pid=5667/state=S) after sig 15
11/22/2012 19:05:10  M    kill_task: process (pid=5667/state=Z) after sig 15
11/22/2012 19:05:10  M    kill_task: killing pid 5715 task 8 gracefully 
with sig
                           15
11/22/2012 19:05:10  M    kill_task: process (pid=5715/state=R) after sig 15
11/22/2012 19:05:10  M    kill_task: killing pid 5749 task 9 gracefully 
with sig
                           15
11/22/2012 19:05:10  M    kill_task: process (pid=5749/state=S) after sig 15
11/22/2012 19:05:10  M    kill_task: process (pid=5749/state=Z) after sig 15
11/22/2012 19:05:10  M    kill_task: killing pid 5807 task 9 gracefully 
with sig
                           15
11/22/2012 19:05:10  M    kill_task: process (pid=5807/state=R) after sig 15
11/22/2012 19:05:11  S    Not sending email: User does not want mail of this
                           type.
11/22/2012 19:05:11  S    Exit_status=2
11/22/2012 19:05:11  S    dequeuing from pbsdsh, state COMPLETE
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 3
                           terminated, sid=5487
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 4
                           terminated, sid=5505
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 5
                           terminated, sid=5519
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 6
                           terminated, sid=5545
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 7
                           terminated, sid=5594
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 8
                           terminated, sid=5667
11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task 9
                           terminated, sid=5749
11/22/2012 19:05:11  M    obit sent to server
11/22/2012 19:05:11  S    on_job_exit valid pjob: 0x7f9a7c0ae6e0 
(substate=50)
11/22/2012 19:05:11  M    removed job script
11/22/2012 19:05:11  A    user=smahajan group=DSIMB jobname=test_20_40
                           queue=pbsdsh ctime=1353577893 qtime=1353577893
                           etime=1353577893 start=1353577895
                           owner=smahajan at metis.univ.run
                           
exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
                           Resource_List.mem=8gb 
Resource_List.neednodes=1:ppn=8
                           Resource_List.nodect=1 
Resource_List.nodes=1:ppn=8
                           Resource_List.walltime=24:00:00 session=5452
                           end=1353596711 Exit_status=2

Regards,
Delphine


More information about the torqueusers mailing list