[torqueusers] jobs termination pbsdsh

Ken Nielson knielson at adaptivecomputing.com
Wed Nov 28 12:07:31 MST 2012


We know of this bug.

Regards

Ken

On Thu, Nov 22, 2012 at 9:42 PM, Delphine Ramalingom <
delphine.ramalingom at univ-reunion.fr> wrote:

> Hi eveybody,
>
> Some of jobs (not all) are terminating before the requested walltime
> when we used pbsdsh and we are getting a message in /var/spool/mail file
> that these jobs have exceeded the wallclock time.
> Is there a reason for this that I don't know ? Can you help me ?
>
> We used :
> - Maui - version 3.3.
> - Torque - version 4.0.2
>
> When I used tracejob, I have :
>
> Job: 446.metis.univ.run
>
> 11/22/2012 13:51:33  S    enqueuing into pbsdsh, state 1 hop 1
> 11/22/2012 13:51:33  S    Job Queued at request of smahajan at metis.univ.run
> ,
>                            owner = smahajan at metis.univ.run, job name =
>                            test_20_40, queue = pbsdsh
> 11/22/2012 13:51:33  A    queue=pbsdsh
> 11/22/2012 13:51:35  S    Job Run at request of root at metis.univ.run
> 11/22/2012 13:51:35  M    start_process: task started, tid 2, sid 5479, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35  M    start_process: task started, tid 3, sid 5487, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35  M    start_process: task started, tid 4, sid 5505, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35  M    start_process: task started, tid 5, sid 5519, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35  M    start_process: task started, tid 6, sid 5545, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:35  A    user=smahajan group=DSIMB jobname=test_20_40
>                            queue=pbsdsh ctime=1353577893 qtime=1353577893
>                            etime=1353577893 start=1353577895
>                            owner=smahajan at metis.univ.run
>
>
> exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
>                            Resource_List.mem=8gb
> Resource_List.neednodes=1:ppn=8
>                            Resource_List.nodect=1
> Resource_List.nodes=1:ppn=8
>                            Resource_List.walltime=24:00:00
> 11/22/2012 13:51:36  M    start_process: task started, tid 7, sid 5594, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:36  M    start_process: task started, tid 8, sid 5667, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 13:51:36  M    start_process: task started, tid 9, sid 5749, cmd
>
> /labos/dsimb/smahajan/fold_pred/PROGRAMS/global_pbsdsh
> 11/22/2012 19:05:07  M    scan_for_terminated: job 446.metis.univ.run task
> 2
>                            terminated, sid=5479
> 11/22/2012 19:05:07  M    scan_for_terminated: job 446.metis.univ.run task
> 1
>                            terminated, sid=5452
> 11/22/2012 19:05:07  M    kill_task: killing pid 5487 task 3 gracefully
> with sig
>                            15
> 11/22/2012 19:05:07  M    kill_task: process (pid=5487/state=Z) after sig
> 15
> 11/22/2012 19:05:07  M    kill_task: killing pid 5500 task 3 gracefully
> with sig
>                            15
> 11/22/2012 19:05:07  M    kill_task: process (pid=5500/state=Z) after sig
> 15
> 11/22/2012 19:05:07  M    kill_task: killing pid 5505 task 4 gracefully
> with sig
>                            15
> 11/22/2012 19:05:07  M    kill_task: process (pid=5505/state=R) after sig
> 15
> 11/22/2012 19:05:07  M    kill_task: process (pid=5505/state=Z) after sig
> 15
> 11/22/2012 19:05:07  M    kill_task: killing pid 5515 task 4 gracefully
> with sig
>                            15
> 11/22/2012 19:05:07  M    kill_task: process (pid=5515/state=R) after sig
> 15
> 11/22/2012 19:05:08  M    kill_task: killing pid 5519 task 5 gracefully
> with sig
>                            15
> 11/22/2012 19:05:08  M    kill_task: process (pid=5519/state=S) after sig
> 15
> 11/22/2012 19:05:08  M    kill_task: process (pid=5519/state=Z) after sig
> 15
> 11/22/2012 19:05:08  M    kill_task: killing pid 5537 task 5 gracefully
> with sig
>                            15
> 11/22/2012 19:05:08  M    kill_task: process (pid=5537/state=R) after sig
> 15
> 11/22/2012 19:05:08  M    kill_task: killing pid 5545 task 6 gracefully
> with sig
>                            15
> 11/22/2012 19:05:08  M    kill_task: process (pid=5545/state=R) after sig
> 15
> 11/22/2012 19:05:09  M    kill_task: process (pid=5545/state=Z) after sig
> 15
> 11/22/2012 19:05:09  M    kill_task: killing pid 5572 task 6 gracefully
> with sig
>                            15
> 11/22/2012 19:05:09  M    kill_task: process (pid=5572/state=R) after sig
> 15
> 11/22/2012 19:05:09  M    kill_task: killing pid 5594 task 7 gracefully
> with sig
>                            15
> 11/22/2012 19:05:09  M    kill_task: process (pid=5594/state=S) after sig
> 15
> 11/22/2012 19:05:09  M    kill_task: process (pid=5594/state=Z) after sig
> 15
> 11/22/2012 19:05:09  M    kill_task: killing pid 5633 task 7 gracefully
> with sig
>                            15
> 11/22/2012 19:05:09  M    kill_task: process (pid=5633/state=R) after sig
> 15
> 11/22/2012 19:05:09  M    kill_task: killing pid 5667 task 8 gracefully
> with sig
>                            15
> 11/22/2012 19:05:09  M    kill_task: process (pid=5667/state=S) after sig
> 15
> 11/22/2012 19:05:10  M    kill_task: process (pid=5667/state=Z) after sig
> 15
> 11/22/2012 19:05:10  M    kill_task: killing pid 5715 task 8 gracefully
> with sig
>                            15
> 11/22/2012 19:05:10  M    kill_task: process (pid=5715/state=R) after sig
> 15
> 11/22/2012 19:05:10  M    kill_task: killing pid 5749 task 9 gracefully
> with sig
>                            15
> 11/22/2012 19:05:10  M    kill_task: process (pid=5749/state=S) after sig
> 15
> 11/22/2012 19:05:10  M    kill_task: process (pid=5749/state=Z) after sig
> 15
> 11/22/2012 19:05:10  M    kill_task: killing pid 5807 task 9 gracefully
> with sig
>                            15
> 11/22/2012 19:05:10  M    kill_task: process (pid=5807/state=R) after sig
> 15
> 11/22/2012 19:05:11  S    Not sending email: User does not want mail of
> this
>                            type.
> 11/22/2012 19:05:11  S    Exit_status=2
> 11/22/2012 19:05:11  S    dequeuing from pbsdsh, state COMPLETE
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 3
>                            terminated, sid=5487
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 4
>                            terminated, sid=5505
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 5
>                            terminated, sid=5519
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 6
>                            terminated, sid=5545
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 7
>                            terminated, sid=5594
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 8
>                            terminated, sid=5667
> 11/22/2012 19:05:11  M    scan_for_terminated: job 446.metis.univ.run task
> 9
>                            terminated, sid=5749
> 11/22/2012 19:05:11  M    obit sent to server
> 11/22/2012 19:05:11  S    on_job_exit valid pjob: 0x7f9a7c0ae6e0
> (substate=50)
> 11/22/2012 19:05:11  M    removed job script
> 11/22/2012 19:05:11  A    user=smahajan group=DSIMB jobname=test_20_40
>                            queue=pbsdsh ctime=1353577893 qtime=1353577893
>                            etime=1353577893 start=1353577895
>                            owner=smahajan at metis.univ.run
>
>
> exec_host=metis.univ.run/12+metis.univ.run/11+metis.univ.run/10+metis.univ.run/9+metis.univ.run/8+metis.univ.run/7+metis.univ.run/6+metis.univ.run/5
>                            Resource_List.mem=8gb
> Resource_List.neednodes=1:ppn=8
>                            Resource_List.nodect=1
> Resource_List.nodes=1:ppn=8
>                            Resource_List.walltime=24:00:00 session=5452
>                            end=1353596711 Exit_status=2
>
> Regards,
> Delphine
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121128/9e8c012c/attachment.html 


More information about the torqueusers mailing list