Looks like there is a double free() inside pbs_server. Valgrind shows the following sequence: {{{ ==4857== Invalid write of size 4 ==4857== at 0x4C23E5A: decode_DIS_replySvr (dec_rpys.c:133) ==4857== by 0x40A34F: DIS_reply_read (dis_read.c:457) ==4857== by 0x40B8CA: process_Dreply (issue_request.c:717) ==4857== by 0x4C3B257: wait_request (net_server.c:507) ==4857== by 0x41C266: main_loop (pbsd_main.c:1186) ==4857== by 0x41CF4F: main (pbsd_main.c:1741) ==4857== Address 0x5631618 is 1,144 bytes inside a block of size 6,416 free'd ==4857== at 0x4A05A31: free (vg_replace_malloc.c:325) ==4857== by 0x41F583: free_br (process_request.c:1343) ==4857== by 0x4202AA: reply_send (reply_send.c:308) ==4857== by 0x4202F4: reply_ack (reply_send.c:334) ==4857== by 0x42AA1D: req_modifyjob (req_modify.c:677) ==4857== by 0x41EF44: dispatch_request (process_request.c:848) ==4857== by 0x41ED94: process_request (process_request.c:695) ==4857== by 0x4C3B257: wait_request (net_server.c:507) ==4857== by 0x41C266: main_loop (pbsd_main.c:1186) ==4857== by 0x41CF4F: main (pbsd_main.c:1741) }}} So, a block was freed by free_br(), but was subsequently touched by the proces_Dreply() and other DIS routines. I can also run the same code inside GDB, it will go a bit further and the following dump of the request can be obtained: {{{ (gdb) bt #0 0x0000003e64830265 in raise () from /lib64/libc.so.6 #1 0x0000003e64831d10 in abort () from /lib64/libc.so.6 #2 0x0000003e6486a84b in __libc_message () from /lib64/libc.so.6 #3 0x0000003e648722ef in _int_free () from /lib64/libc.so.6 #4 0x0000003e6487273b in free () from /lib64/libc.so.6 #5 0x000000000041f584 in free_br (preq=0xa03b8e0) at process_request.c:1343 #6 0x00000000004202ab in reply_send (request=0xa03b8e0) at reply_send.c:308 #7 0x00000000004202f5 in reply_ack (preq=0xa03b8e0) at reply_send.c:334 #8 0x000000000042a126 in post_modify_req (pwt=0x9d3a580) at req_modify.c:194 #9 0x00000000004414d8 in dispatch_task (ptask=0x9d3a580) at svr_task.c:206 #10 0x000000000040b900 in process_Dreply (sock=13) at issue_request.c:727 #11 0x0000003206827258 in wait_request (waittime=1, SState=0x72fe38) at ../Libnet/net_server.c:507 #12 0x000000000041c267 in main_loop () at pbsd_main.c:1186 #13 0x000000000041cf50 in main (argc=2, argv=0x7ffff82aa868) at pbsd_main.c:1741 (gdb) fr #6 0x00000000004202ab in reply_send (request=0xa03b8e0) at reply_send.c:308 308 free_br(request); (gdb) print *request $4 = {rq_link = {ll_prior = 0xa03b8e0, ll_next = 0xa03b8e0, ll_struct = 0xa03b8d0}, rq_type = 168016080, rq_perm = 0, rq_fromsvr = 0, rq_conn = 11, rq_orgconn = 11, rq_extsz = 0, rq_time = 1280900857, rq_user = "root", '\0' , rq_host = "shed.core.kiae", '\0' , rq_XXXX = 0, rq_extra = 0x0, rq_noreply = 0, rq_extend = 0x0, rq_reply = {brp_code = 0, brp_auxcode = 0, brp_choice = 1, brp_un = { brp_jid = '\0' , brp_select = 0x0, brp_status = { ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}, brp_statc = 0x0, brp_txt = {brp_txtlen = 0, brp_str = 0x0}, brp_locate = '\0' , brp_rescq = {brq_number = 0, brq_avail = 0x0, brq_alloc = 0x0, brq_resvd = 0x0, brq_down = 0x0}}}, rq_ind = {rq_authen = {rq_port = 2}, rq_connect = 2, rq_queuejob = { rq_destin = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_jid = '\0' , " Å\003\n\000\000\000\000 Å\003\n", '\0' , "±e\000\000\000\000\000\0008*µd>\000\000\0008*µd>", '\0' , rq_attr = {ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}}, rq_jobcred = {rq_type = 2, rq_size = 3330473738518933553, rq_data = 0x6972672e64656873
}, rq_jobfile = {rq_sequence = 2, rq_type = 2, rq_size = 3330473738518933553, rq_jobid = "shed.grid.kiae.ru", '\0' , " Å\003\n\000\000", rq_data = 0xa03c5a0 " Å\003\n"}, rq_rdytocommit = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_commit = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_delete = {rq_cmd = 2, rq_objtype = 2, rq_objname = "1006888.shed.grid.kiae.ru", '\0' , rq_attr = {ll_prior = 0xa03c5a0, ll_next = 0xa03c5a0, ll_struct = 0x0}}, rq_hold = {rq_orig = {rq_cmd = 2, rq_objtype = 2, rq_objname = "1006888.shed.grid.kiae.ru", '\0' , rq_attr = {ll_prior = 0xa03c5a0, ll_next = 0xa03c5a0, ll_struct = 0x0}}, rq_hpref = 0}, rq_locate = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_manager = {rq_cmd = 2, rq_objtype = 2, rq_objname = "1006888.shed.grid.kiae.ru", '\0' , rq_attr = {ll_prior = 0xa03c5a0, ll_next = 0xa03c5a0, ll_struct = 0x0}}, rq_message = {rq_file = 2, rq_jid = "\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_text = 0xa03c5a0 " Å\003\n"}, rq_modify = {rq_cmd = 2, rq_objtype = 2, rq_objname = "1006888.shed.grid.kiae.ru", '\0' , rq_attr = {ll_prior = 0xa03c5a0, ll_next = 0xa03c5a0, ll_struct = 0x0}}, rq_move = { rq_jid = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_destin = "\000\000\000\000\000\000\000\000\000 Å\003\n\000\000\000\000 Å\003\n", '\0' , "±e\000\000\000\000\000\0008*µd>\000\000\0008*µd>", '\0' }, rq_register = { rq_owner = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", rq_svr = '\0' , " Å", rq_parent = "\003\n\000\000\000\000 Å\003\n", '\0' , "±e\000\000\000\000\000\0008*µd>\000\000\0008*µd>", '\0' , rq_child = '\0' , rq_dependtype = 0, rq_op = 0, rq_cost = 0}, rq_release = {rq_cmd = 2, rq_objtype = 2, rq_objname = "1006888.shed.grid.kiae.ru", '\0' , ---Type to continue, or q to quit--- rq_attr = {ll_prior = 0xa03c5a0, ll_next = 0xa03c5a0, ll_struct = 0x0}}, rq_rerun = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_rescq = {rq_rhandle = 2, rq_num = 2, rq_list = 0x2e38383836303031}, rq_run = { rq_jid = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_destin = 0x0, rq_resch = 168019360}, rq_select = { ll_prior = 0x200000002, ll_next = 0x2e38383836303031, ll_struct = 0x6972672e64656873}, rq_shutdown = 2, rq_signal = { rq_jid = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_signame = "\000\000\000\000\000\000\000\000\000 Å\003\n\000\000\000"}, rq_status = { rq_id = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_attr = {ll_prior = 0x0, ll_next = 0xa03c5a0, ll_struct = 0xa03c5a0}}, rq_track = {rq_hopcount = 2, rq_jid = "\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_location = "\000\000\000\000\000 Å\003\n\000\000\000\000 Å\003\n", '\0' , "±e\000\000\000\000\000\0008*µd>\000\000\0008*µd>", '\0' , rq_state = "\000"}, rq_cpyfile = { rq_jobid = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_owner = "\000\000\000\000\000\000\000\000\000 Å\003\n\000\000\000\000 Å\003\n", '\0' , rq_user = '\0' , rq_group = '\0' , rq_dir = 0, rq_pair = { ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}}, rq_returnfiles = { rq_jobid = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_return_stdout = 0, rq_return_stderr = 0}, rq_jobobit = { rq_jid = "\002\000\000\000\002\000\000\0001006888.shed.grid.kiae.ru", '\0' , rq_status = 0, rq_attr = {ll_prior = 0xa03c5a0, ll_next = 0xa03c5a0, ll_struct = 0x0}}}} }}} Request is somewhat damaged, because is has wrong rq_type, but may be some bits of information still can be gathered from it. Legal free ========== Let me guess what was done in the first block: {{{ ==4857== at 0x4A05A31: free (vg_replace_malloc.c:325) ==4857== by 0x41F583: free_br (process_request.c:1343) ==4857== by 0x4202AA: reply_send (reply_send.c:308) ==4857== by 0x4202F4: reply_ack (reply_send.c:334) ==4857== by 0x42AA1D: req_modifyjob (req_modify.c:677) ==4857== by 0x41EF44: dispatch_request (process_request.c:848) ==4857== by 0x41ED94: process_request (process_request.c:695) ==4857== by 0x4C3B257: wait_request (net_server.c:507) ==4857== by 0x41C266: main_loop (pbsd_main.c:1186) ==4857== by 0x41CF4F: main (pbsd_main.c:1741) }}} - wait_request() was processing the reply from one of the clients; cn_func for the connection was process_request(), so the client's socket was just added by the accept_conn(). - process_request() is called as the socket connection handler, here a new request is allocated by alloc_br() and is inserted to the svr_requests list. It does the full processing of an initial request and calls dispatch_request() at the end. - dispatch_request() determines request type (for our case, it is ModifyJob and calls appropriate handler, our one is req_modifyjob(). - req_modifyjob() does some checks and invokes modify_job(). It returns zero, so reply_ack() is called on our request. - reply_ack() fills some data inside request->rq_reply and spawns reply_send(). - reply_send() does its job (presumably, we're handling the remote connection, but I can't definitely confirm it) and calls free_br() that deallocates the request. Illegal free ============ Now I need to look at the illegal code block that touches already freed memory: {{{ ==4857== Invalid write of size 4 ==4857== at 0x4C23E5A: decode_DIS_replySvr (dec_rpys.c:133) ==4857== by 0x40A34F: DIS_reply_read (dis_read.c:457) ==4857== by 0x40B8CA: process_Dreply (issue_request.c:717) ==4857== by 0x4C3B257: wait_request (net_server.c:507) ==4857== by 0x41C266: main_loop (pbsd_main.c:1186) ==4857== by 0x41CF4F: main (pbsd_main.c:1741) ==4857== Address 0x5631618 is 1,144 bytes inside a block of size 6,416 free'd }}} - wait_request() behaves as in the prior case, but it calls process_Dreply() as the cn_func for the connection. Such connection handler can be called from 5 places: * server/node_manager.c:761; * server/req_jobobit.c:653; * server/req_stat.c:798; * server/issue_request.c:175; * server/issue_request.c:281. - process_Dreply() looks up the request via the work_task.wt_parm1. Work_task is extracted from the task_list_event: it looks for requests of type WORK_Deferred_Reply and which wt_event is equal to the svr_conn[sock].cn_handle. Type of WORK_Deferred_Reply can be set from the single point: issue_Drequest(), server/issue_request.c:404; the same routine adds the task to the task_list_event list. So, we need to understand what entity can call issue_Drequest() with the request that was created and subsequently destroyed in the legal free code block. The function pointer is passed as the 3rd argument to issue_Drequest() and it is subsequently saved in the work_task.wt_func by the set_task(). GDB shows that our wt_func is post_modify_req. Looking for issue_Drequest() ============================ Once again, we're looking for the following code path: {{{ ==4857== at 0x4A05A31: free (vg_replace_malloc.c:325) ==4857== by 0x41F583: free_br (process_request.c:1343) ==4857== by 0x4202AA: reply_send (reply_send.c:308) ==4857== by 0x4202F4: reply_ack (reply_send.c:334) ==4857== by 0x42AA1D: req_modifyjob (req_modify.c:677) ==4857== by 0x41EF44: dispatch_request (process_request.c:848) ==4857== by 0x41ED94: process_request (process_request.c:695) ==4857== by 0x4C3B257: wait_request (net_server.c:507) ==4857== by 0x41C266: main_loop (pbsd_main.c:1186) ==4857== by 0x41CF4F: main (pbsd_main.c:1741) }}} We also need the list of callers for the issue_Drequest that will use post_modify_req() as the 3rd argument. And we're lucky: the only function that indirectly calls issue_Drequest() with such handler is modify_job (since the only reference to a post_modify_req() is found within this function): * modify_job(), server/req_modify.c:433; * relay_to_mom(), server/issue_request.c:195; * issue_Drequest(). Since we have req_modifyjob() in our call stack, it is rather natural that is it modify_job() who uses our request and pushes it to the list of work tasks. It happens at server/req_modify.c:668.