[torqueusers] sudden pbs_server & pbs_mom segfaults

Dimitris Zilaskos dzila at tassadar.physics.auth.gr
Wed May 27 05:29:56 MDT 2009


Tom Rudwick wrote:
 > I would check for the possibility of corrupted files in the erver_priv
 > directory. If you can clear out and recreate the serverdb (configuration)
 > and/or job files on the affected nodes, you could confirm or rule out
 > that cause.
 >
 > Good luck,
 > Tom

Thank you for the hint. I will try it. Today I got more crashes on the 
same nodes.

ce01 (pbs_server):

Crashed but I had forgotten to launch it under gdb. When I did it 
segfaulted immediately. Second try it worked

Program received signal SIGSEGV, Segmentation fault.
0x0806016c in tdelete ()
(gdb) bt
#0 0x0806016c in tdelete ()
#1 0x0805befb in tdelete ()
#2 0x007efc7b in wait_request () from /usr/lib/libtorque.so.2
#3 0x0805aafe in tdelete ()
#4 0x00594de3 in __libc_start_main () from /lib/tls/libc.so.6
#5 0x0804c7f1 in ?? ()

and from a wn where pbs_mom crashed

scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
Detaching after fork from child process 11484.
pbs_mom: Success (0) in obit_reply, DIS_reply_read failed, rc=11 sock=11
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
pbs_mom: Connection refused (111) in scan_for_exiting, cannot bind to 
port 1023 in client_to_svr - connection refused
scan_for_exiting: sending signal 9, "KILL" to job 
97607.ce01.afroditi.hellasgrid.gr, reason: local task termination detected
Detaching after fork from child process 11485.
Detaching after fork from child process 11557.
pbs_mom: wait_request, connection 11 to host 3288020855 has timed out 
out after 900 seconds - closing stale connection


Program received signal SIGSEGV, Segmentation fault.
0x0000000000417585 in ?? ()
(gdb) bt
#0  0x0000000000417585 in ?? ()
#1  0x0000000000418b78 in tfind ()
#2  0x00000000004193f0 in tfind ()
#3  0x000000000041409f in ?? ()
#4  0x0000000000414102 in ?? ()
#5  0x0000003b8601fe54 in wait_request () from /usr/lib64/libtorque.so.2
#6  0x000000000041704f in ?? ()
#7  0x00000000004173dc in ?? ()
#8  0x0000003fb621c3fb in __libc_start_main () from /lib64/tls/libc.so.6
#9  0x00000000004063ca in ?? ()
#10 0x0000007fbfffdca8 in ?? ()
#11 0x000000000000001c in ?? ()
#12 0x0000000000000001 in ?? ()
#13 0x0000007fbfffe225 in ?? ()
#14 0x0000000000000000 in ?? ()
(gdb)


05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_all_update_stat;composing status update for server
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[0]: pid 8349 
sid 8348
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[1]: pid 25492 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25493 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25495 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25497 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25499 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25501 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25503 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25505 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25507 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25509 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25511 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25513 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25515 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25517 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25519 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25521 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[0]: pid 8349 
sid 8348
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[1]: pid 25492 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25493 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25495 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25497 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25499 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25501 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25503 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25505 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25507 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25509 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25511 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25513 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25515 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25517 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25519 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;sessions;sessions[2]: pid 25521 
sid 25491
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[0]: pid 8349 uid 27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25492 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25493 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25495 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25497 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25499 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25501 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25503 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25505 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25507 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25509 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25511 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25513 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25515 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25517 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25519 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 25521 uid 
27003
05/27/2009 13:05:31;0002;   pbs_mom;n/a;totmem;totmem: total mem=6207422464
05/27/2009 13:05:31;0002;   pbs_mom;n/a;availmem;availmem: free 
mem=5907398656
05/27/2009 13:05:31;0008;   pbs_mom;Job;momgetattr;found fs = 
/var/spool/pbs/tmpdir
05/27/2009 13:05:31;0008;   pbs_mom;Job;momgetattr;passing back fs = 
/var/spool/pbs/tmpdir
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "opsys=linux"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "uname=Linux wn017 2.6.9-78.0.22.ELlargesmp #1 SMP T
hu Apr 30 19:55:41 CDT 2009 x86_64"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "sessions=8348 25491"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "nsessions=2"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "nusers=1"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "idletime=413677"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "totmem=6061936kb"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "availmem=5768944kb"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "physmem=2053728kb"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "ncpus=4"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "loadave=0.00"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "netload=1219647491162"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "size=68226432kb:72873600kb"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "state=free"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "jobs=96977.ce01.afroditi.hellasgrid.gr"
05/27/2009 13:05:31;0002; 
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "varattr= "
05/27/2009 13:05:31;0002;   pbs_mom;n/a;mom_server_update_stat;status 
update successfully sent to ce01.afroditi.hellasgrid.gr
05/27/2009 13:05:51;0080;   pbs_mom;Svr;mom_get_sample;proc_array load 
started
05/27/2009 13:05:51;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded 
- nproc=105
05/27/2009 13:06:14;0008;   pbs_mom;Job;do_rpp;got an inter-server request
05/27/2009 13:06:14;0001;   pbs_mom;Job;is_request;stream 0 version 1


Cheers,
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pbserver20090527.txt
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20090527/6dba8fe7/attachment-0001.txt 


More information about the torqueusers mailing list