[torqueusers] Torque 4.1.4: Deadlock in job creation
Joerg Blank
j.blank at fz-juelich.de
Thu Feb 7 06:31:12 MST 2013
Hello everyone,
I recently found a deadlock in job creation. It has a chance to occur
when a large array job is submitted and in the process of being created
and a second job is submitted with a dependency on the first, before its
creation is finished.
I attached the gdb output of my analysis at the end of this email or
under http://pastie.org/6087081 without line breaks.
Regards,
Jörg Blank
(gdb) info threads
19 Thread 16091 0x00007fbdc2350c5d in nanosleep () at
../sysdeps/unix/syscall-template.S:82
18 Thread 16120 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
17 Thread 16119 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
16 Thread 16118 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
15 Thread 16117 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
14 Thread 16116 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
13 Thread 16115 0x00007fbdc2374c13 in *__GI___poll (fds=<value
optimized out>, nfds=<value optimized out>, timeout=120000) at
../sysdeps/unix/sysv/linux/poll.c:87
12 Thread 16114 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
11 Thread 16113 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
10 Thread 16112 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
9 Thread 16111 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
8 Thread 16110 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
7 Thread 16109 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
6 Thread 16108 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
5 Thread 16107 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
4 Thread 16104 0x00007fbdc2350c5d in nanosleep () at
../sysdeps/unix/syscall-template.S:82
3 Thread 16103 0x00007fbdc2350c5d in nanosleep () at
../sysdeps/unix/syscall-template.S:82
2 Thread 16102 0x00007fbdc282838d in accept () at
../sysdeps/unix/syscall-template.S:82
* 1 Thread 16099 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
Thread 15 and 8 are in a deadlock. This occurs when a array job is in
the process of being generated and a second job is submitted with a
dependency on the first job is submitted.
(gdb) thread 15
[Switching to thread 15 (Thread 16117)]#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
136 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007fbdc2823179 in _L_lock_953 () from /lib/libpthread.so.0
#2 0x00007fbdc2822f9b in __pthread_mutex_lock (mutex=0x70cd800) at
pthread_mutex_lock.c:61
#3 0x000000000040a6b4 in is_array (id=Unhandled dwarf expression opcode
0xf3
) at array_func.c:137
#4 0x0000000000425e3c in dispatch_request (sfds=65535,
request=0x7fbd94048ff0) at process_request.c:927
#5 0x000000000040eaac in issue_to_svr (servern=Unhandled dwarf
expression opcode 0xf3
) at issue_request.c:324
#6 0x000000000043a713 in send_depend_req (pjob=0x7c876d0,
pparent=0x7fbd94043160, type=12, op=1, schedhint=0, postfunc=0x439d40
<post_doq>) at req_register.c:2262
#7 0x000000000043bbb9 in depend_on_que (pattr=Unhandled dwarf
expression opcode 0xf3
) at req_register.c:1237
#8 0x000000000044a5c5 in svr_enquejob (pjob=0x7c876d0,
has_sv_qs_mutex=0, prev_job_index=65) at svr_jobfunc.c:575
#9 0x00000000004121fc in job_clone_wt (cloned_id=Unhandled dwarf
expression opcode 0xf3
) at job_func.c:1153
#10 0x000000000045acd2 in work_thread (a=0x7fff7fa38120) at
u_threadpool.c:307
#11 0x00007fbdc28208ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#12 0x00007fbdc237fb6d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#13 0x0000000000000000 in ?? ()
(gdb) print *(pthread_mutex_t*)0x70cd800
$2 = {__data = {__lock = 2, __count = 0, __owner = 16110, __nusers = 1,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000\356>\000\000\001", '\000'
<repeats 26 times>, __align = 2}
Lock is hold by thread 8
(gdb) thread 8
[Switching to thread 8 (Thread 16110)]#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
136 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007fbdc2823179 in _L_lock_953 () from /lib/libpthread.so.0
#2 0x00007fbdc2822f9b in __pthread_mutex_lock (mutex=0x7fbdb02858b0) at
pthread_mutex_lock.c:61
#3 0x000000000044a61d in lock_ai_mutex (pa=0x7fbdb02103f0, id=Unhandled
dwarf expression opcode 0xf3
) at svr_jobfunc.c:2940
#4 0x000000000040a77c in get_array (id=Unhandled dwarf expression
opcode 0xf3
) at array_func.c:164
#5 0x000000000043a342 in req_registerarray (preq=0x7fbdb82e0170) at
req_register.c:718
#6 0x0000000000425e4c in dispatch_request (sfds=65535,
request=0x7fbdb82e0170) at process_request.c:929
#7 0x000000000040eaac in issue_to_svr (servern=Unhandled dwarf
expression opcode 0xf3
) at issue_request.c:324
#8 0x000000000043a713 in send_depend_req (pjob=0x7fbdb82e32e0,
pparent=0x7fbdb80a5ed0, type=12, op=1, schedhint=0, postfunc=0x439d40
<post_doq>) at req_register.c:2262
#9 0x000000000043bbb9 in depend_on_que (pattr=Unhandled dwarf
expression opcode 0xf3
) at req_register.c:1237
#10 0x000000000044a5c5 in svr_enquejob (pjob=0x7fbdb82e32e0,
has_sv_qs_mutex=0, prev_job_index=-1) at svr_jobfunc.c:575
#11 0x0000000000438cf4 in req_commit (preq=0x7fbdb81bfb80) at
req_quejob.c:2141
#12 0x0000000000426119 in dispatch_request (sfds=11,
request=0x7fbdb81bfb80) at process_request.c:741
#13 0x000000000042687e in process_request (chan=Unhandled dwarf
expression opcode 0xf3
) at process_request.c:662
#14 0x0000000000422da8 in process_pbs_server_port (sock=11,
is_scheduler_port=0) at pbsd_main.c:410
#15 0x0000000000422e9a in start_process_pbs_server_port
(new_sock=Unhandled dwarf expression opcode 0xf3
) at pbsd_main.c:541
#16 0x000000000045acd2 in work_thread (a=0x7fff7fa38120) at
u_threadpool.c:307
#17 0x00007fbdc28208ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#18 0x00007fbdc237fb6d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x0000000000000000 in ?? ()
(gdb) print *(pthread_mutex_t*)0x7fbdb02858b0
$3 = {__data = {__lock = 2, __count = 0, __owner = 16117, __nusers = 1,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000\365>\000\000\001", '\000'
<repeats 26 times>, __align = 2}
Lock is hold by thread 15
Thread 15 holds the ai lock. I do not know where the lock is aquired,
but it is release in job_clone_wt shortly after all jobs are enqueued.
(gdb) thread 8
[Switching to thread 8 (Thread 16110)]#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
136 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) frame 4
#4 0x000000000040a77c in get_array (id=Unhandled dwarf expression
opcode 0xf3
) at array_func.c:164
164 lock_ai_mutex(pa, __func__, NULL, LOGLEVEL);
Current language: auto
The current source language is "auto; currently c".
(gdb) print pa
$4 = (job_array *) 0x7fbdb02103f0
(gdb) thread 15
[Switching to thread 15 (Thread 16117)]#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
136 ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file
or directory.
in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
Current language: auto
The current source language is "auto; currently asm".
(gdb) frame 9
#9 0x00000000004121fc in job_clone_wt (cloned_id=Unhandled dwarf
expression opcode 0xf3
) at job_func.c:1153
1153 if ((rc = svr_enquejob(pjobclone, FALSE, prev_index)))
Current language: auto
The current source language is "auto; currently c".
(gdb) print pa
$5 = (job_array *) 0x7fbdb02103f0
More information about the torqueusers
mailing list