[torqueusers] Torque 4.1.4: Deadlock in job creation
David Beer
dbeer at adaptivecomputing.com
Fri Feb 15 15:27:17 MST 2013
Jorg,
I was able to reproduce and fix the first deadlock with the following
check-ins:
8d51275..ef452c8 4.1-dev -> 4.1-dev
4b90e69..aa5f8d5 master -> master
Can you provide more details on how to reproduce the second one?
David
On Thu, Feb 7, 2013 at 2:40 PM, Joerg Blank <j.blank at fz-juelich.de> wrote:
> Hello,
>
> I found another deadlock, this time when a job gets deleted. I was not
> able to pinpoint the offending lock.
>
> Regards,
> Jörg Blank
>
>
> (gdb) info threads
> 19 Thread 13950 0x00007f3ef94cdc5d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:82
> 18 Thread 14161 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 17 Thread 14160 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 16 Thread 14159 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 15 Thread 14158 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 14 Thread 14157 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 13 Thread 14156 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 12 Thread 14155 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 11 Thread 14154 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 10 Thread 14153 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 9 Thread 14152 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 8 Thread 14151 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 7 Thread 14150 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 6 Thread 14149 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 5 Thread 14148 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 4 Thread 14145 0x00007f3ef94cdc5d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:82
> 3 Thread 14144 0x00007f3ef94cdc5d in nanosleep () at
> ../sysdeps/unix/syscall-template.S:82
> 2 Thread 14143 0x00007f3ef99a538d in accept () at
> ../sysdeps/unix/syscall-template.S:82
> * 1 Thread 14142 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
>
> (gdb) thread 17
> [Switching to thread 17 (Thread 14160)]#0 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 136 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> (gdb) bt
> #0 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> #1 0x00007f3ef99a0179 in _L_lock_953 () from /lib/libpthread.so.0
> #2 0x00007f3ef999ff9b in __pthread_mutex_lock (mutex=0x5729cb0) at
> pthread_mutex_lock.c:61
> #3 0x0000000000448760 in lock_ji_mutex (pjob=0x5731100, id=Unhandled
> dwarf expression opcode 0xf3
> ) at svr_jobfunc.c:2863
> #4 0x0000000000411370 in remove_job (aj=0xa972c0, pjob=0x5731100) at
> job_func.c:2562
> #5 0x0000000000449aac in svr_dequejob (pjob=0x5731100,
> parent_queue_mutex_held=0) at svr_jobfunc.c:758
> #6 0x0000000000411e17 in svr_job_purge (pjob=0x5731100) at job_func.c:1776
> #7 0x000000000042c5c8 in handle_complete_second_time (ptask=Unhandled
> dwarf expression opcode 0xf3
> ) at req_jobobit.c:1800
> #8 0x000000000045acd2 in work_thread (a=0x7fff14ff7180) at
> u_threadpool.c:307
> #9 0x00007f3ef999d8ca in start_thread (arg=<value optimized out>) at
> pthread_create.c:300
> #10 0x00007f3ef94fcb6d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #11 0x0000000000000000 in ?? ()
> (gdb) print *(pthread_mutex_t*)0x5729cb0
> $4 = {__data = {__lock = 2, __count = 0, __owner = 14154, __nusers = 0,
> __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = "\002\000\000\000\000\000\000\000J7", '\000' <repeats 29
> times>, __align = 2}
>
> (gdb) thread 11
> [Switching to thread 11 (Thread 14154)]#0 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> 136 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> (gdb) bt
> #0 __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
> #1 0x00007f3ef99a0179 in _L_lock_953 () from /lib/libpthread.so.0
> #2 0x00007f3ef999ff9b in __pthread_mutex_lock (mutex=0x51cde90) at
> pthread_mutex_lock.c:61
> #3 0x000000000044a864 in lock_alljobs_mutex (aj=0xa972c0, id=Unhandled
> dwarf expression opcode 0xf3
> ) at svr_jobfunc.c:3017
> #4 0x0000000000410aee in find_job_by_array (aj=0xa972c0,
> job_id=0x7f3ee41f99a0 "30979[34].glorim-1.cluster", get_subjob=1) at
> job_func.c:2140
> #5 0x0000000000410e16 in svr_find_job (jobid=0x7f3ee41f99a0
> "30979[34].glorim-1.cluster", get_subjob=1) at job_func.c:2245
> #6 0x000000000042c53a in handle_complete_second_time
> (ptask=0x7f3ee40361d0) at req_jobobit.c:1765
> #7 0x000000000045acd2 in work_thread (a=0x7fff14ff7180) at
> u_threadpool.c:307
> #8 0x00007f3ef999d8ca in start_thread (arg=<value optimized out>) at
> pthread_create.c:300
> #9 0x00007f3ef94fcb6d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #10 0x0000000000000000 in ?? ()
> (gdb) print *(pthread_mutex_t*)0x51cde90
> $5 = {__data = {__lock = 2, __count = 0, __owner = 14160, __nusers = 1,
> __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = "\002\000\000\000\000\000\000\000P7\000\000\001", '\000'
> <repeats 26 times>, __align = 2}
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130215/0df9d0b7/attachment-0001.html
More information about the torqueusers
mailing list