[torqueusers] pbs_mom stuck in loop

David Beer dbeer at adaptivecomputing.com
Fri Jan 6 10:23:23 MST 2012


Mike,

To me it would seem that this kind of loop can only happen as a result of some kind of data corruption (probably the same kind that in other cases is causing crashes). Do you have a core file for these crashes? That's probably the best way to find out what is going wrong.

As I'm looking at things it occurs to me that 2.5.6 had a pretty huge unprotected data bug (someone added a thread without adding any kind of protection) that could cause this exact problem. It is highly likely that upgrading will resolve this issue. I'm sorry to get you this reply late, as I've been busy and haven't been monitoring the mailing list as much as I often do.

David

----- Original Message -----
> We have a cluster of about 64 nodes running Scyld Clusterware 5.6.3
> which ships with Torque 2.5.6 and we are running Maui 3.3.1 as the
> scheduler on top of that. We are seeing nodes approximately daily
> showing up as down according to Torque, but they otherwise look
> normal. Sometimes we find that pbs_mom has crashed, but other times
> we find that it is still running and appears to be stuck in a loop.
> Yesterday I had a node that was stuck in the loop and I was able to
> attach gdb to the process to confirm that. I found that it was stuck
> in the function scan_non_child_tasks inside mom_mach.c. I'm able to
> step forward to confirm that it keeps running a loop in that
> function. I'm not a programmer, but it looks to me like the linked
> list that it is attempting to traverse has the same address for both
> the next and previous tasks no matter how many time we go through
> the loop. Here is some output to demonstrate:
> 
> 3761    in mom_mach.c
> (gdb) p *task
> $2 = {ti_job = 0xa5e6130, ti_jobtask = {ll_prior = 0xa5e5108,
>     ll_next = 0xa5e5108, ll_struct = 0xa5e5100}, ti_fd = -1, ti_flags
>     = 0,
>   ti_register = 0, ti_obits = {ll_prior = 0xa5e5130, ll_next =
>   0xa5e5130,
>     ll_struct = 0x0}, ti_info = {ll_prior = 0xa5e5148, ll_next =
>     0xa5e5148,
>     ll_struct = 0x0}, ti_qs = {
>     ti_parentjobid = "199592.mio.mines.edu", '\000' <repeats 1026
>     times>,
>     ti_parentnode = -1, ti_parenttask = 0, ti_task = 1, ti_status =
>     3,
>     ti_sid = 86040, ti_exitstat = 0, ti_u = {ti_hold = {
>         0 <repeats 16 times>}}}}
> (gdb) n
> 3783    in mom_mach.c
> (gdb) n
> 3761    in mom_mach.c
> (gdb) n
> 3751    in mom_mach.c
> (gdb) n
> 3761    in mom_mach.c
> (gdb) p *task
> $3 = {ti_job = 0xa5e6130, ti_jobtask = {ll_prior = 0xa5e5108,
>     ll_next = 0xa5e5108, ll_struct = 0xa5e5100}, ti_fd = -1, ti_flags
>     = 0,
>   ti_register = 0, ti_obits = {ll_prior = 0xa5e5130, ll_next =
>   0xa5e5130,
>     ll_struct = 0x0}, ti_info = {ll_prior = 0xa5e5148, ll_next =
>     0xa5e5148,
>     ll_struct = 0x0}, ti_qs = {
>     ti_parentjobid = "199592.mio.mines.edu", '\000' <repeats 1026
>     times>,
>     ti_parentnode = -1, ti_parenttask = 0, ti_task = 1, ti_status =
>     3,
>     ti_sid = 86040, ti_exitstat = 0, ti_u = {ti_hold = {
>         0 <repeats 16 times>}}}}
> (gdb) bt full
> #0  scan_non_child_tasks () at mom_mach.c:3761
>         dent = <value optimized out>
>         task = 0xa5e5100
>         job = 0xa5e6130
>         pdir = 0xa62bd10
>         first_time = 0
> #1  0x0000000000416fe9 in main_loop () at mom_main.c:8251
>         myla = 2.4703282292062327e-323
>         tmpTime = <value optimized out>
>         id = "main_loop"
> #2  0x0000000000417221 in main (argc=5, argv=0x7fffa0d03718)
>     at mom_main.c:8406
>         rc = 0
>         tmpFD = <value optimized out>
> (gdb)
> 
> Any thoughts on how we're getting here and better yet how to prevent
> it?
> 
> Thanks,
> Mike Robbert
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list