[torqueusers] pbs_server crash

christophe bonnaud takyon77 at gmail.com
Wed Dec 8 08:47:02 MST 2010


Hi again,

to be sure of what I am doing, I reinstalled the machine without any
middleware, a simple pbs system. Server and client are now using a
scientific linux 5.5 x86_64 ( kernel  2.6.18-194.26.1.el5 )

I still have some crash when a user submit a job. If I restart pbs
server immediately, the job finish normally...

in the system log I can see messages like:

Dec  8 23:42:07 ce02 PBS_Server: LOG_ERROR::sync_node_jobs, stray job
48.ce02.sdfarm.kr found on wn1038.sdfarm.kr
Dec  8 23:42:07 ce02 PBS_Server: LOG_ERROR::sync_node_jobs, stray job
48.ce02.sdfarm.kr found on wn1038.sdfarm.kr
Dec  8 23:45:51 ce02 kernel: pbs_server[15693]: segfault at 00000003792977a0
rip 00002ab0b635dd14 rsp 00007fffc0a38ca0 error 4
Dec  8 23:45:51 ce02 PBS_Server: LOG_ERROR::Bad file descriptor (9) in
DIS_tcp_setup, invalid file descriptor (1798204786) for socket

If I run pbs_server in gdb, I obtain this error:

(gdb) run
Starting program: /usr/sbin/pbs_server
pbs_server is up
entered spec=wn1038.sdfarm.kr
job allocation debug: 1 requested, 8 svr_clnodes, 1 svr_totnodes
node_spec: wn1038.sdfarm.kr nsn 8, nsnfree 8, nsnshared 0
node_spec: wn1038.sdfarm.kr/0 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/1 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/2 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/3 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/4 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/5 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/6 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/7 inuse 0x0 nprops 3
job allocation debug(2): 1 requested, 1 svr_numnodes
job allocation debug(3): returning 1 requested
allocated node wn1038.sdfarm.kr/0 to job 55.ce02.sdfarm.kr (nsnfree=8)
Detaching after fork from child process 7981.
catch_child caught pid 7981
catch_child found work task found for pid 7981
*** glibc detected *** /usr/sbin/pbs_server: double free or corruption
(!prev): 0x00000000011b42a0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x2aaaab05530f]
/lib64/libc.so.6(cfree+0x4b)[0x2aaaab05576b]
/usr/sbin/pbs_server[0x41f384]
/usr/sbin/pbs_server[0x4200ab]
/usr/sbin/pbs_server[0x4200f5]
/usr/sbin/pbs_server[0x429f26]
/usr/sbin/pbs_server[0x4412d8]
/usr/sbin/pbs_server[0x40b8a0]
/usr/lib64/libtorque.so.2(wait_request+0x264)[0x2aaaaacf1b50]
/usr/sbin/pbs_server[0x41c067]
/usr/sbin/pbs_server[0x41cd50]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x2aaaab000994]
/usr/sbin/pbs_server[0x406639]
======= Memory map: ========
00400000-0045b000 r-xp 00000000 fd:01 491770
/usr/sbin/pbs_server
0065b000-00662000 rw-p 0005b000 fd:01 491770
/usr/sbin/pbs_server
00662000-011ca000 rw-p 00662000 00:00 0
 [heap]
3d83600000-3d8360d000 r-xp 00000000 08:02 459069
/lib64/libgcc_s-4.1.2-20080825.so.1
3d8360d000-3d8380d000 ---p 0000d000 08:02 459069
/lib64/libgcc_s-4.1.2-20080825.so.1
3d8380d000-3d8380e000 rw-p 0000d000 08:02 459069
/lib64/libgcc_s-4.1.2-20080825.so.1
2aaaaaaab000-2aaaaaac7000 r-xp 00000000 08:02 459032
/lib64/ld-2.5.so
2aaaaaac7000-2aaaaaac9000 rw-p 2aaaaaac7000 00:00 0
2aaaaaad1000-2aaaaac8c000 rw-p 2aaaaaad1000 00:00 0
2aaaaacc6000-2aaaaacc7000 r--p 0001b000 08:02 459032
/lib64/ld-2.5.so
2aaaaacc7000-2aaaaacc8000 rw-p 0001c000 08:02 459032
/lib64/ld-2.5.so
2aaaaacc8000-2aaaaacfd000 r-xp 00000000 fd:01 557972
/usr/lib64/libtorque.so.2.0.0
2aaaaacfd000-2aaaaaefd000 ---p 00035000 fd:01 557972
/usr/lib64/libtorque.so.2.0.0
2aaaaaefd000-2aaaaaeff000 rw-p 00035000 fd:01 557972
/usr/lib64/libtorque.so.2.0.0
2aaaaaeff000-2aaaaafe3000 rw-p 2aaaaaeff000 00:00 0
2aaaaafe3000-2aaaab131000 r-xp 00000000 08:02 458762
/lib64/libc-2.5.so
2aaaab131000-2aaaab330000 ---p 0014e000 08:02 458762
/lib64/libc-2.5.so
2aaaab330000-2aaaab334000 r--p 0014d000 08:02 458762
/lib64/libc-2.5.so
2aaaab334000-2aaaab335000 rw-p 00151000 08:02 458762
/lib64/libc-2.5.so
2aaaab335000-2aaaab33c000 rw-p 2aaaab335000 00:00 0
2aaaab33c000-2aaaab346000 r-xp 00000000 08:02 458778
/lib64/libnss_files-2.5.so
2aaaab346000-2aaaab545000 ---p 0000a000 08:02 458778
/lib64/libnss_files-2.5.so
2aaaab545000-2aaaab546000 r--p 00009000 08:02 458778
/lib64/libnss_files-2.5.so
2aaaab546000-2aaaab547000 rw-p 0000a000 08:02 458778
/lib64/libnss_files-2.5.so
7ffffff37000-7ffffffff000 rw-p 7ffffff37000 00:00 0
 [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0
 [vdso]

Program received signal SIGABRT, Aborted.
0x00002aaaab013265 in raise () from /lib64/libc.so.6


anyone can help?

Cheers,

Chris.




On Tue, Dec 7, 2010 at 10:25 PM, christophe bonnaud <takyon77 at gmail.com>wrote:

> Hello,
>
> I am not an expert in torque/pbs so I hope my message will be clear enough.
>
> I have just installed a new torque server/mom compiled from source using
> command:
>     ./configure --with-server-home=/var/spool/pbs --prefix=/usr
>      make rpm
>
> The server is running on a scientific linux 4.6 32bit ( kernel
> 2.6.9-89.31.1.EL.cernsmp ) and the client is running on a Scientific Linux
> 5.5 64bit ( kernel 2.6.18-194.26.1.el5 ).
> Of course server and client were compiled on the machine it running.
>
> I use this pbs server with the lcg middleware for Computing Element.
>
> For the moment only one worker node is used to try to find the problem.
>
> A simple manual job submission is working find but when a job arrive
> through the grid, the pbs crash.
>
> The configuration for pbs is generated automatically by the installation of
> the middleware but I tried to put a basic configuration as following:
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue alice
> #
> create queue alice
> set queue alice queue_type = Execution
> set queue alice acl_group_enable = True
> set queue alice acl_groups = alice
> set queue alice acl_groups += alicesgm
> set queue alice enabled = True
> set queue alice started = True
> #
> # Create and define queue ops
> #
> create queue ops
> set queue ops queue_type = Execution
> set queue ops acl_group_enable = True
> set queue ops acl_groups = ops
> set queue ops acl_groups += opssgm
> set queue ops enabled = True
> set queue ops started = True
> #
> # Create and define queue dteam
> #
> create queue dteam
> set queue dteam queue_type = Execution
> set queue dteam acl_group_enable = True
> set queue dteam acl_groups = dteam
> set queue dteam acl_groups += dteamsgm
> set queue dteam enabled = True
> set queue dteam started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_host_enable = False
> set server acl_hosts = ce02.sdfarm.kr
> set server managers = root at ce02.sdfarm.kr
> set server operators = root at ce02.sdfarm.kr
> set server default_queue = dteam
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server default_node = lcgpro
> set server node_pack = False
> set server log_level = 7
> set server kill_delay = 10
> set server next_job_number = 204
>
> nodes file contain only one line:
>
> wn1038.sdfarm.kr np=8 lcgpro ops dteam alice
>
>
> pbs logs before crash by job 240:
>
>
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request ReadyToCommit on sd=11
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;ready to commit
> job
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type ReadyToCommit on socket 11
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;ready to commit
> job completed
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> Commit from dteam018
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type Commit request received from
> dteam018 at ce02.sdfarm.kr, sock=11
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request Commit on sd=11
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;committing job
> 12/08/2010 15:06:04;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
> job 240.ce02.sdfarm.kr state from TRANSIT-TRANSICM to QUEUED-PRESTAGEIN
> (1-11)
> 12/08/2010 15:06:04;0100;PBS_Server;Job;240.ce02.sdfarm.kr;enqueuing into
> dteam, state 1 hop 1
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type Commit on socket 11
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Reply sent for
> request type Commit on socket 11
> 12/08/2010 15:06:04;0040;PBS_Server;Svr;ce02.sdfarm.kr;Scheduler was sent
> the command new
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> Disconnect from dteam018
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> StatusNode from root
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type StatusNode request received
> from root at ce02.sdfarm.kr, sock=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request StatusNode on sd=10
> 12/08/2010 15:06:04;0040;PBS_Server;Req;req_stat_node;entered
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type StatusNode on socket 10
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> StatusQueue from root
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type StatusQueue request received
> from root at ce02.sdfarm.kr, sock=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request StatusQueue on sd=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type StatusQueue on socket 10
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> StatusJob from root
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type StatusJob request received
> from root at ce02.sdfarm.kr, sock=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request StatusJob on sd=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type StatusJob on socket 10
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> ModifyJob from root
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type ModifyJob request received
> from root at ce02.sdfarm.kr, sock=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request ModifyJob on sd=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;attr
> Resource_List modified
> 12/08/2010 15:06:04;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
> job 240.ce02.sdfarm.kr state from QUEUED-PRESTAGEIN to QUEUED-PRESTAGEIN
> (1-11)
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Job Modified at
> request of root at ce02.sdfarm.kr
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type ModifyJob on socket 10
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> RunJob from root
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type RunJob request received from
> root at ce02.sdfarm.kr, sock=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request RunJob on sd=10
> 12/08/2010 15:06:04;0040;PBS_Server;Req;set_nodes;allocating nodes for job
> 240.ce02.sdfarm.kr with node expression 'wn1038.sdfarm.kr'
> 12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;entered spec=
> wn1038.sdfarm.kr
> 12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;job allocation debug: 1
> requested, 8 svr_clnodes, 1 svr_totnodes
> 12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;job allocation debug(2):
> 1 requested, 1 svr_numnodes
> 12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;job allocation debug(3):
> returning 1 requested
> 12/08/2010 15:06:04;0040;PBS_Server;Req;add_job_to_node;allocated node
> wn1038.sdfarm.kr/0 to job 240.ce02.sdfarm.kr (nsnfree=8)
> 12/08/2010 15:06:04;0040;PBS_Server;Req;set_nodes;job 240.ce02.sdfarm.krallocated 1 nodes (nodelist=
> wn1038.sdfarm.kr/0)
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Job Run at
> request of root at ce02.sdfarm.kr
> 12/08/2010 15:06:04;0040;PBS_Server;Req;relay_to_mom;momaddr=134.75.123.138
> 12/08/2010 15:06:04;0004;PBS_Server;Svr;svr_connect;attempting connect to
> host 134.75.123.138 port 15002
> 12/08/2010 15:06:04;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
> job 240.ce02.sdfarm.kr state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO
> (4-15)
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type RunJob on socket 10
> 12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
> ModifyJob from root
> 12/08/2010 15:06:04;0100;PBS_Server;Req;;Type ModifyJob request received
> from root at ce02.sdfarm.kr, sock=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching
> request ModifyJob on sd=10
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;attr
> Resource_List modified
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Job Modified at
> request of root at ce02.sdfarm.kr
> 12/08/2010 15:06:04;0040;PBS_Server;Req;relay_to_mom;momaddr=134.75.123.138
> 12/08/2010 15:06:04;0004;PBS_Server;Svr;svr_connect;attempting connect to
> host 134.75.123.138 port 15002
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type ModifyJob on socket 12
> 12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;post_modify_req:
> PBSE_UNKJOBID for job 240.ce02.sdfarm.kr in state RUNNING-STAGEGO, dest =
> wn1038.sdfarm.kr
> 12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
> type NONE on socket 10
>
>
> on pbs_mom, I have the following:
>
> 12/08/2010 15:06:04;0080;   pbs_mom;Req;dis_request_read;decoding command
> CopyFiles from PBS_Server
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
> CopyFiles from host ce02.sdfarm.kr received
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
> CopyFiles from host ce02.sdfarm.kr allowed
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;dispatch_request;dispatching
> request CopyFiles on sd=10
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;240.ce02.sdfarm.kr;attempting to
> copy file 'ce02.sdfarm.kr:
> /home/dteam018/.lcgjm/globus-cache-export.io8870/globus-cache-export.io8870.gpg'
> 12/08/2010 15:06:04;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
> pre-sigprocmask
> 12/08/2010 15:06:04;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
> post-initgroups
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;N/A;forking to user, uid: 11218
>  gid: 11200  homedir: '/home/dteam018'
> 12/08/2010 15:06:04;0002;   pbs_mom;n/a;mom_close_poll;entered
> 12/08/2010 15:06:04;0080;   pbs_mom;Req;dis_request_read;decoding command
> ModifyJob from PBS_Server
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
> ModifyJob from host ce02.sdfarm.kr received
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
> ModifyJob from host ce02.sdfarm.kr allowed
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;dispatch_request;dispatching
> request ModifyJob on sd=12
> 12/08/2010 15:06:04;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id REJHOST=wn1038.sdfarm.kr MSG=modify job failed,
> unknown job 240.ce02.sdfarm.kr), aux=0, type=ModifyJob, from
> PBS_Server at ce02.sdfarm.kr
> 12/08/2010 15:06:04;0080;   pbs_mom;Req;dis_request_read;decoding command
> Disconnect from PBS_Server
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;scan_for_terminated;entered
> 12/08/2010 15:06:04;0080;   pbs_mom;Svr;mom_get_sample;proc_array load
> started
> 12/08/2010 15:06:04;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded -
> nproc=210
> 12/08/2010 15:06:04;0008;   pbs_mom;Job;scan_for_terminated;pid 12666 not
> tracked, statloc=0, exitval=0
>
>
>
> To try to find the problem I compiled the sources using debug mode and in
> gdb I obtain the following informations:
>
> (gdb) run
> Starting program: /usr/sbin/pbs_server
> pbs_server is up
> entered spec=wn1038.sdfarm.kr
> job allocation debug: 1 requested, 8 svr_clnodes, 1 svr_totnodes
> node_spec: wn1038.sdfarm.kr nsn 8, nsnfree 8, nsnshared 0
> node_spec: wn1038.sdfarm.kr/0 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/1 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/2 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/3 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/4 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/5 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/6 inuse 0x0 nprops 3
> node_spec: wn1038.sdfarm.kr/7 inuse 0x0 nprops 3
> job allocation debug(2): 1 requested, 1 svr_numnodes
> job allocation debug(3): returning 1 requested
> allocated node wn1038.sdfarm.kr/0 to job 202.ce02.sdfarm.kr (nsnfree=8)
> *** glibc detected *** double free or corruption (!prev): 0x0a6f8e90 ***
>
> Program received signal SIGABRT, Aborted.
> 0x007047a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> (gdb) where
> #0  0x007047a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x00745915 in raise () from /lib/tls/libc.so.6
> #2  0x00747379 in abort () from /lib/tls/libc.so.6
> #3  0x00779e1a in __libc_message () from /lib/tls/libc.so.6
> #4  0x0078081f in _int_free () from /lib/tls/libc.so.6
> #5  0x00780c9a in free () from /lib/tls/libc.so.6
> #6  0x080648b4 in free_br ()
> #7  0x080655f3 in reply_send ()
> #8  0x08065636 in reply_ack ()
> #9  0x0806f23e in post_modify_req ()
> #10 0x0808548f in dispatch_task ()
> #11 0x08051957 in process_Dreply ()
> #12 0x00f1dace in wait_request (waittime=17, SState=0x8122b3c) at
> ../Libnet/net_server.c:507
> #13 0x080614d3 in main_loop ()
> #14 0x0806225f in main ()
>
>
>
> Does any one have any suggestions to what I should do from here?
>
> is it possible that the os is too old for the server? is there some
> compatibility issues?
>
> Cheers,
>
> Chris.
>
>
> --
> ------------------------------------------------------
> Bonnaud Christophe
> GSDC
> Korea Institute of Science and Technology Information
> Fax. +82-42-869-0789
> Tel. +82-42-869-0660
> Mobile +82-10-4664-3193
>



-- 
------------------------------------------------------
Bonnaud Christophe
GSDC
Korea Institute of Science and Technology Information
Fax. +82-42-869-0789
Tel. +82-42-869-0660
Mobile +82-10-4664-3193
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101208/2502cfc6/attachment-0001.html 


More information about the torqueusers mailing list