[torquedev] pbs_mom crashing with segfault after pbs_server restart

Ling C. Ho ling at fnal.gov
Wed Mar 25 13:03:53 MDT 2009


We noticed that sometimes when pbs_server restarts using SIGTERM, our pbs_mom processes (on worker 
nodes) died with segfault (shows up in /var/log/messages). This usually happens right after 
something like this:

03/24/2009 21:26:29;0008;   pbs_mom;Job;do_rpp;got an inter-server request
03/24/2009 21:26:29;0001;   pbs_mom;Job;is_request;stream 0 version 1

I traced it down to the mom_server_find_by_ip function. At the line

addr = rpp_getaddr(pms->SStream)

If rpp_getaddr returns a NULL, segfault happens.

By doing this, I don't get the segfault even after a few hundreds restart of the pbs_server.

       if ((addr = rpp_getaddr(pms->SStream)) == NULL) {
         return (NULL);

Does this looks like a potential problem?


More information about the torquedev mailing list