[Mauiusers] Possible Memory Corruption in maui

Dr. Stephan Raub raub at uni-duesseldorf.de
Tue Nov 8 08:40:45 MST 2011


Dear fellow maui users,

we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5
(2.6.8-194.26.1.el1) on a 600-somewhat core cluster.

We experienced a sudden death of the maui scheduler with no message in the
logs. We could not figure out a reason so we attached an "strace" to the
maui process (as long as it was "still alive") and we got:

select(1024, [9], NULL, NULL, {0, 10000}) = 0 (Timeout)

accept(6, 0x7fff9bda9210, [11230449699255222288]) = -1 EAGAIN (Resource
temporarily unavailable)

select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)

stat("/var/log/maui.log", {st_mode=S_IFREG|0640, st_size=9598210, ...}) = 0

write(3, "11/08 16:20:49 MPBSClusterQuery("..., 45) = 45

write(8, "+2+12+58+4root+0+0+0", 20)    = 20
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=8,
revents=POLLIN}])
fcntl(8, F_GETFL)                       = 0x2 (flags O_RDWR)
read(8, "+2+1+0+0+63+115+3+6gauss1+32+11+"..., 262144) = 62514
open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device or
address)
writev(2, [{"*** glibc detected *** ", 23}, {"/usr/local/maui/sbin/maui",
25}, {": ", 2}, {"malloc(): memory corruption", 27}, {": 0x", 4},
{"0000000012bbff80", 16}, {" ***\n", 5}], 7) = 102
open("/usr/local/torque/lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such
file or directory)
open("tls/x86_64/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or
directory)
open("tls/libgcc_s.so.1", O_RDONLY)     = -1 ENOENT (No such file or
directory)
open("x86_64/libgcc_s.so.1", O_RDONLY)  = -1 ENOENT (No such file or
directory)
open("libgcc_s.so.1", O_RDONLY)         = -1 ENOENT (No such file or
directory)
open("/usr/local/torque/lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such
file or directory)
open("/usr/local/maui/lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such
file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 10
fstat(10, {st_mode=S_IFREG|0644, st_size=101081, ...}) = 0
mmap(NULL, 101081, PROT_READ, MAP_PRIVATE, 10, 0) = 0x2ae371556000
close(10)                               = 0
open("/lib64/libgcc_s.so.1", O_RDONLY)  = 10
read(10,
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\36\240\0033\0\0\0"..., 832)
= 832
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
-1, 0) = 0x2ae37156f000
munmap(0x2ae37156f000, 44634112)        = 0
munmap(0x2ae378000000, 22474752)        = 0
mprotect(0x2ae374000000, 135168, PROT_READ|PROT_WRITE) = 0
fstat(10, {st_mode=S_IFREG|0755, st_size=58400, ...}) = 0
mmap(0x3303a00000, 2151784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE,
10, 0) = 0x3303a00000
mprotect(0x3303a0d000, 2097152, PROT_NONE) = 0
mmap(0x3303c0d000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 10, 0xd000) = 0x3303c0d000
close(10)                               = 0
munmap(0x2ae371556000, 101081)          = 0
write(2, "======= Backtrace: =========\n", 29) = 29
writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"3300672fae", 10}, {"]\n",
2}], 4) = 31
writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_malloc", 13},
{"+0x", 3}, {"6e", 2}, {")", 1}, {"[0x", 3}, {"3300674cde", 10}, {"]\n",
2}], 9) = 51
writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1}, {"disrst",
6}, {"+0x", 3}, {"fd", 2}, {")", 1}, {"[0x", 3}, {"2ae3709c987d", 12},
{"]\n", 2}], 9) = 66
writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1},
{"decode_DIS_attrl", 16}, {"+0x", 3}, {"c4", 2}, {")", 1}, {"[0x", 3},
{"2ae3709caa64", 12}, {"]\n", 2}], 9) = 76
writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1},
{"decode_DIS_replyCmd", 19}, {"+0x", 3}, {"23d", 3}, {")", 1}, {"[0x", 3},
{"2ae3709cb8bd", 12}, {"]\n", 2}], 9) = 80
writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1},
{"PBSD_rdrpy", 10}, {"+0x", 3}, {"80", 2}, {")", 1}, {"[0x", 3},
{"2ae3709cf6d0", 12}, {"]\n", 2}], 9) = 70
writev(2, [{"/usr/local/torque/lib/libtorque."..., 36}, {"(", 1},
{"PBSD_status_get", 15}, {"+0x", 3}, {"26", 2}, {")", 1}, {"[0x", 3},
{"2ae3709d0786", 12}, {"]\n", 2}], 9) = 75
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"49f501", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"45c399", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"45e8c0", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"493054", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"49683d", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"42a266", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"40591d", 6},
{"]\n", 2}], 4) = 36
writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17},
{"+0x", 3}, {"f4", 2}, {")", 1}, {"[0x", 3}, {"330061d994", 10}, {"]\n",
2}], 9) = 55
writev(2, [{"/usr/local/maui/sbin/maui", 25}, {"[0x", 3}, {"402c39", 6},
{"]\n", 2}], 4) = 36
write(2, "======= Memory map: ========\n", 29) = 29
open("/proc/self/maps", O_RDONLY)       = 10
read(10, "00400000-004fc000 r-xp 00000000 "..., 1024) = 1024
write(2, "00400000-004fc000 r-xp 00000000 "..., 1024) = 1024
read(10, "0 r-xp 00000000 08:03 18186457  "..., 1024) = 1024
write(2, "0 r-xp 00000000 08:03 18186457  "..., 1024) = 1024
read(10, "_s-4.1.2-20080825.so.1\n3304a0000"..., 1024) = 1024
write(2, "_s-4.1.2-20080825.so.1\n3304a0000"..., 1024) = 1024
read(10, "3 18186474                      "..., 1024) = 1024
write(2, "3 18186474                      "..., 1024) = 1024
read(10, "ae370be5000-2ae370be7000 rw-p 00"..., 1024) = 1024
write(2, "ae370be5000-2ae370be7000 rw-p 00"..., 1024) = 1024
read(10, "ff9bb04000-7fff9bdb4000 rw-p 7ff"..., 1024) = 159
write(2, "ff9bb04000-7fff9bdb4000 rw-p 7ff"..., 159) = 159
read(10, "", 1024)                      = 0
close(10)                               = 0
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(17084, 17084, SIGABRT)           = 0
--- SIGABRT (Aborted) @ 0 (0) ---
Process 17084 detached

The only error message we see is a "memory corruption" after a
MPBSClusterQuery() call.

Is this a known problem? How can we fix this?

Thank you in advance.
Best regards.
--
---------------------------------------------------------
| | Dr. rer. nat. Stephan Raub
| | Dipl. Chem.
| | High-Performance-Computing
| | Zentrum für Informations- und Medientechnologie 
| | Heinrich-Heine-Universität Düsseldorf
| | Universitätsstr. 1 / Raum 25.41.O2.25-2
| | 40225 Düsseldorf / Germany
| |
| | Tel: +49-211-811-3911
| | Fax: +49-211-811-2539
---------------------------------------------------------

Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse,
bzw. 
sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte
benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
Dank.

Important Note: This e-mail may contain trade secrets or privileged,
undisclosed or otherwise confidential information. If you have received this
e-mail in error, you are hereby notified that any review, copying or
distribution of it is strictly prohibited. Please inform us immediately and
destroy the original transmittal. Thank you for your cooperation.






More information about the mauiusers mailing list