[torqueusers] Torque 4.1.4: Deadlock in job creation

David Beer dbeer at adaptivecomputing.com
Fri Feb 15 15:27:17 MST 2013


Jorg,

I was able to reproduce and fix the first deadlock with the following
check-ins:

8d51275..ef452c8  4.1-dev -> 4.1-dev
 4b90e69..aa5f8d5  master -> master

Can you provide more details on how to reproduce the second one?

David

On Thu, Feb 7, 2013 at 2:40 PM, Joerg Blank <j.blank at fz-juelich.de> wrote:

> Hello,
>
> I found another deadlock, this time when a job gets deleted. I was not
> able to pinpoint the offending lock.
>
> Regards,
> Jörg Blank
>
>
> (gdb) info threads
>   19 Thread 13950  0x00007f3ef94cdc5d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:82
>   18 Thread 14161  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   17 Thread 14160  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   16 Thread 14159  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   15 Thread 14158  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   14 Thread 14157  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   13 Thread 14156  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   12 Thread 14155  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   11 Thread 14154  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   10 Thread 14153  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   9 Thread 14152  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   8 Thread 14151  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   7 Thread 14150  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   6 Thread 14149  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   5 Thread 14148  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>   4 Thread 14145  0x00007f3ef94cdc5d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:82
>   3 Thread 14144  0x00007f3ef94cdc5d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:82
>   2 Thread 14143  0x00007f3ef99a538d in accept () at
> ../sysdeps/unix/syscall-template.S:82
> * 1 Thread 14142  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>
> (gdb) thread 17
> [Switching to thread 17 (Thread 14160)]#0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 136     in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> (gdb) bt
> #0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> #1  0x00007f3ef99a0179 in _L_lock_953 () from /lib/libpthread.so.0
> #2  0x00007f3ef999ff9b in __pthread_mutex_lock (mutex=0x5729cb0) at
> pthread_mutex_lock.c:61
> #3  0x0000000000448760 in lock_ji_mutex (pjob=0x5731100, id=Unhandled
> dwarf expression opcode 0xf3
> ) at svr_jobfunc.c:2863
> #4  0x0000000000411370 in remove_job (aj=0xa972c0, pjob=0x5731100) at
> job_func.c:2562
> #5  0x0000000000449aac in svr_dequejob (pjob=0x5731100,
> parent_queue_mutex_held=0) at svr_jobfunc.c:758
> #6  0x0000000000411e17 in svr_job_purge (pjob=0x5731100) at job_func.c:1776
> #7  0x000000000042c5c8 in handle_complete_second_time (ptask=Unhandled
> dwarf expression opcode 0xf3
> ) at req_jobobit.c:1800
> #8  0x000000000045acd2 in work_thread (a=0x7fff14ff7180) at
> u_threadpool.c:307
> #9  0x00007f3ef999d8ca in start_thread (arg=<value optimized out>) at
> pthread_create.c:300
> #10 0x00007f3ef94fcb6d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #11 0x0000000000000000 in ?? ()
> (gdb) print *(pthread_mutex_t*)0x5729cb0
> $4 = {__data = {__lock = 2, __count = 0, __owner = 14154, __nusers = 0,
> __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
>   __size = "\002\000\000\000\000\000\000\000J7", '\000' <repeats 29
> times>, __align = 2}
>
> (gdb) thread 11
> [Switching to thread 11 (Thread 14154)]#0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 136     in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> (gdb) bt
> #0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> #1  0x00007f3ef99a0179 in _L_lock_953 () from /lib/libpthread.so.0
> #2  0x00007f3ef999ff9b in __pthread_mutex_lock (mutex=0x51cde90) at
> pthread_mutex_lock.c:61
> #3  0x000000000044a864 in lock_alljobs_mutex (aj=0xa972c0, id=Unhandled
> dwarf expression opcode 0xf3
> ) at svr_jobfunc.c:3017
> #4  0x0000000000410aee in find_job_by_array (aj=0xa972c0,
> job_id=0x7f3ee41f99a0 "30979[34].glorim-1.cluster", get_subjob=1) at
> job_func.c:2140
> #5  0x0000000000410e16 in svr_find_job (jobid=0x7f3ee41f99a0
> "30979[34].glorim-1.cluster", get_subjob=1) at job_func.c:2245
> #6  0x000000000042c53a in handle_complete_second_time
> (ptask=0x7f3ee40361d0) at req_jobobit.c:1765
> #7  0x000000000045acd2 in work_thread (a=0x7fff14ff7180) at
> u_threadpool.c:307
> #8  0x00007f3ef999d8ca in start_thread (arg=<value optimized out>) at
> pthread_create.c:300
> #9  0x00007f3ef94fcb6d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #10 0x0000000000000000 in ?? ()
> (gdb) print *(pthread_mutex_t*)0x51cde90
> $5 = {__data = {__lock = 2, __count = 0, __owner = 14160, __nusers = 1,
> __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
>   __size = "\002\000\000\000\000\000\000\000P7\000\000\001", '\000'
> <repeats 26 times>, __align = 2}
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130215/0df9d0b7/attachment-0001.html 


More information about the torqueusers mailing list