[torqueusers] jobs termination pbsdsh
Ken Nielson
knielson at adaptivecomputing.com
Wed Nov 28 12:07:31 MST 2012
We know of this bug.
Regards
Ken
On Thu, Nov 22, 2012 at 9:42 PM, Delphine Ramalingom <
delphine.ramalingom at univ-reunion.fr> wrote:
> Hi eveybody,
>
> Some of jobs (not all) are terminating before the requested walltime
> when we used pbsdsh and we are getting a message in /var/spool/mail file
> that these jobs have exceeded the wallclock time.
> Is there a reason for this that I don't know ? Can you help me ?
>
> We used :
> - Maui - version 3.3.
> - Torque - version 4.0.2
>
> When I used tracejob, I have :
>
> Job: 446.metis.univ.run
>
> 11/22/2012 13:51:33 S enqueuing into pbsdsh, state 1 hop 1
> 11/22/2012 13:51:33 S Job Queued at request of smahajan at metis.univ.run
> ,
> owner = smahajan at metis.univ.run, job name =
> test_20_40, queue = pbsdsh
> 11/22/2012 13:51:33 A queue=pbsdsh
> 11/22/2012 13:51:35 S Job Run at request of root at metis.univ.run
> 11/22/2012 13:51:35 M start_process: task started, tid 2, sid 5479, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35 M start_process: task started, tid 3, sid 5487, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35 M start_process: task started, tid 4, sid 5505, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35 M start_process: task started, tid 5, sid 5519, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35 M start_process: task started, tid 6, sid 5545, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35 A user=smahajan group=DSIMB jobname=test_20_40
> queue=pbsdsh ctime=1353577893 qtime=1353577893
> etime=1353577893 start=1353577895
> owner=smahajan at metis.univ.run
>
>
> exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
> Resource_List.mem=8gb
> Resource_List.neednodes=1:ppn=8
> Resource_List.nodect=1
> Resource_List.nodes=1:ppn=8
> Resource_List.walltime=24:00:00
> 11/22/2012 13:51:36 M start_process: task started, tid 7, sid 5594, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:36 M start_process: task started, tid 8, sid 5667, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:36 M start_process: task started, tid 9, sid 5749, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 19:05:07 M scan_for_terminated: job 446.metis.univ.run task
> 2
> terminated, sid=5479
> 11/22/2012 19:05:07 M scan_for_terminated: job 446.metis.univ.run task
> 1
> terminated, sid=5452
> 11/22/2012 19:05:07 M kill_task: killing pid 5487 task 3 gracefully
> with sig
> 15
> 11/22/2012 19:05:07 M kill_task: process (pid=5487/state=Z) after sig
> 15
> 11/22/2012 19:05:07 M kill_task: killing pid 5500 task 3 gracefully
> with sig
> 15
> 11/22/2012 19:05:07 M kill_task: process (pid=5500/state=Z) after sig
> 15
> 11/22/2012 19:05:07 M kill_task: killing pid 5505 task 4 gracefully
> with sig
> 15
> 11/22/2012 19:05:07 M kill_task: process (pid=5505/state=R) after sig
> 15
> 11/22/2012 19:05:07 M kill_task: process (pid=5505/state=Z) after sig
> 15
> 11/22/2012 19:05:07 M kill_task: killing pid 5515 task 4 gracefully
> with sig
> 15
> 11/22/2012 19:05:07 M kill_task: process (pid=5515/state=R) after sig
> 15
> 11/22/2012 19:05:08 M kill_task: killing pid 5519 task 5 gracefully
> with sig
> 15
> 11/22/2012 19:05:08 M kill_task: process (pid=5519/state=S) after sig
> 15
> 11/22/2012 19:05:08 M kill_task: process (pid=5519/state=Z) after sig
> 15
> 11/22/2012 19:05:08 M kill_task: killing pid 5537 task 5 gracefully
> with sig
> 15
> 11/22/2012 19:05:08 M kill_task: process (pid=5537/state=R) after sig
> 15
> 11/22/2012 19:05:08 M kill_task: killing pid 5545 task 6 gracefully
> with sig
> 15
> 11/22/2012 19:05:08 M kill_task: process (pid=5545/state=R) after sig
> 15
> 11/22/2012 19:05:09 M kill_task: process (pid=5545/state=Z) after sig
> 15
> 11/22/2012 19:05:09 M kill_task: killing pid 5572 task 6 gracefully
> with sig
> 15
> 11/22/2012 19:05:09 M kill_task: process (pid=5572/state=R) after sig
> 15
> 11/22/2012 19:05:09 M kill_task: killing pid 5594 task 7 gracefully
> with sig
> 15
> 11/22/2012 19:05:09 M kill_task: process (pid=5594/state=S) after sig
> 15
> 11/22/2012 19:05:09 M kill_task: process (pid=5594/state=Z) after sig
> 15
> 11/22/2012 19:05:09 M kill_task: killing pid 5633 task 7 gracefully
> with sig
> 15
> 11/22/2012 19:05:09 M kill_task: process (pid=5633/state=R) after sig
> 15
> 11/22/2012 19:05:09 M kill_task: killing pid 5667 task 8 gracefully
> with sig
> 15
> 11/22/2012 19:05:09 M kill_task: process (pid=5667/state=S) after sig
> 15
> 11/22/2012 19:05:10 M kill_task: process (pid=5667/state=Z) after sig
> 15
> 11/22/2012 19:05:10 M kill_task: killing pid 5715 task 8 gracefully
> with sig
> 15
> 11/22/2012 19:05:10 M kill_task: process (pid=5715/state=R) after sig
> 15
> 11/22/2012 19:05:10 M kill_task: killing pid 5749 task 9 gracefully
> with sig
> 15
> 11/22/2012 19:05:10 M kill_task: process (pid=5749/state=S) after sig
> 15
> 11/22/2012 19:05:10 M kill_task: process (pid=5749/state=Z) after sig
> 15
> 11/22/2012 19:05:10 M kill_task: killing pid 5807 task 9 gracefully
> with sig
> 15
> 11/22/2012 19:05:10 M kill_task: process (pid=5807/state=R) after sig
> 15
> 11/22/2012 19:05:11 S Not sending email: User does not want mail of
> this
> type.
> 11/22/2012 19:05:11 S Exit_status=2
> 11/22/2012 19:05:11 S dequeuing from pbsdsh, state COMPLETE
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 3
> terminated, sid=5487
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 4
> terminated, sid=5505
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 5
> terminated, sid=5519
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 6
> terminated, sid=5545
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 7
> terminated, sid=5594
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 8
> terminated, sid=5667
> 11/22/2012 19:05:11 M scan_for_terminated: job 446.metis.univ.run task
> 9
> terminated, sid=5749
> 11/22/2012 19:05:11 M obit sent to server
> 11/22/2012 19:05:11 S on_job_exit valid pjob: 0x7f9a7c0ae6e0
> (substate=50)
> 11/22/2012 19:05:11 M removed job script
> 11/22/2012 19:05:11 A user=smahajan group=DSIMB jobname=test_20_40
> queue=pbsdsh ctime=1353577893 qtime=1353577893
> etime=1353577893 start=1353577895
> owner=smahajan at metis.univ.run
>
>
> exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
> Resource_List.mem=8gb
> Resource_List.neednodes=1:ppn=8
> Resource_List.nodect=1
> Resource_List.nodes=1:ppn=8
> Resource_List.walltime=24:00:00 session=5452
> end=1353596711 Exit_status=2
>
> Regards,
> Delphine
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121128/9e8c012c/attachment.html
More information about the torqueusers
mailing list