[torqueusers] pbs_server crash

christophe bonnaud takyon77 at gmail.com
Tue Dec 7 23:25:56 MST 2010


Hello,

I am not an expert in torque/pbs so I hope my message will be clear enough.

I have just installed a new torque server/mom compiled from source using
command:
    ./configure --with-server-home=/var/spool/pbs --prefix=/usr
     make rpm

The server is running on a scientific linux 4.6 32bit ( kernel
2.6.9-89.31.1.EL.cernsmp ) and the client is running on a Scientific Linux
5.5 64bit ( kernel 2.6.18-194.26.1.el5 ).
Of course server and client were compiled on the machine it running.

I use this pbs server with the lcg middleware for Computing Element.

For the moment only one worker node is used to try to find the problem.

A simple manual job submission is working find but when a job arrive through
the grid, the pbs crash.

The configuration for pbs is generated automatically by the installation of
the middleware but I tried to put a basic configuration as following:
#
# Create queues and set their attributes.
#
#
# Create and define queue alice
#
create queue alice
set queue alice queue_type = Execution
set queue alice acl_group_enable = True
set queue alice acl_groups = alice
set queue alice acl_groups += alicesgm
set queue alice enabled = True
set queue alice started = True
#
# Create and define queue ops
#
create queue ops
set queue ops queue_type = Execution
set queue ops acl_group_enable = True
set queue ops acl_groups = ops
set queue ops acl_groups += opssgm
set queue ops enabled = True
set queue ops started = True
#
# Create and define queue dteam
#
create queue dteam
set queue dteam queue_type = Execution
set queue dteam acl_group_enable = True
set queue dteam acl_groups = dteam
set queue dteam acl_groups += dteamsgm
set queue dteam enabled = True
set queue dteam started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = ce02.sdfarm.kr
set server managers = root at ce02.sdfarm.kr
set server operators = root at ce02.sdfarm.kr
set server default_queue = dteam
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server default_node = lcgpro
set server node_pack = False
set server log_level = 7
set server kill_delay = 10
set server next_job_number = 204

nodes file contain only one line:

wn1038.sdfarm.kr np=8 lcgpro ops dteam alice


pbs logs before crash by job 240:


12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
ReadyToCommit on sd=11
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;ready to commit
job
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type ReadyToCommit on socket 11
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;ready to commit
job completed
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
Commit from dteam018
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type Commit request received from
dteam018 at ce02.sdfarm.kr, sock=11
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
Commit on sd=11
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;committing job
12/08/2010 15:06:04;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 240.ce02.sdfarm.kr state from TRANSIT-TRANSICM to QUEUED-PRESTAGEIN
(1-11)
12/08/2010 15:06:04;0100;PBS_Server;Job;240.ce02.sdfarm.kr;enqueuing into
dteam, state 1 hop 1
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type Commit on socket 11
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Reply sent for
request type Commit on socket 11
12/08/2010 15:06:04;0040;PBS_Server;Svr;ce02.sdfarm.kr;Scheduler was sent
the command new
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
Disconnect from dteam018
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
StatusNode from root
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type StatusNode request received
from root at ce02.sdfarm.kr, sock=10
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
StatusNode on sd=10
12/08/2010 15:06:04;0040;PBS_Server;Req;req_stat_node;entered
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type StatusNode on socket 10
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
StatusQueue from root
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type StatusQueue request received
from root at ce02.sdfarm.kr, sock=10
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
StatusQueue on sd=10
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type StatusQueue on socket 10
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
StatusJob from root
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type StatusJob request received
from root at ce02.sdfarm.kr, sock=10
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
StatusJob on sd=10
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type StatusJob on socket 10
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
ModifyJob from root
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type ModifyJob request received
from root at ce02.sdfarm.kr, sock=10
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
ModifyJob on sd=10
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;attr
Resource_List modified
12/08/2010 15:06:04;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 240.ce02.sdfarm.kr state from QUEUED-PRESTAGEIN to QUEUED-PRESTAGEIN
(1-11)
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Job Modified at
request of root at ce02.sdfarm.kr
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type ModifyJob on socket 10
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
RunJob from root
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type RunJob request received from
root at ce02.sdfarm.kr, sock=10
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
RunJob on sd=10
12/08/2010 15:06:04;0040;PBS_Server;Req;set_nodes;allocating nodes for job
240.ce02.sdfarm.kr with node expression 'wn1038.sdfarm.kr'
12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;entered spec=
wn1038.sdfarm.kr
12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;job allocation debug: 1
requested, 8 svr_clnodes, 1 svr_totnodes
12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;job allocation debug(2): 1
requested, 1 svr_numnodes
12/08/2010 15:06:04;0040;PBS_Server;Req;node_spec;job allocation debug(3):
returning 1 requested
12/08/2010 15:06:04;0040;PBS_Server;Req;add_job_to_node;allocated node
wn1038.sdfarm.kr/0 to job 240.ce02.sdfarm.kr (nsnfree=8)
12/08/2010 15:06:04;0040;PBS_Server;Req;set_nodes;job
240.ce02.sdfarm.krallocated 1 nodes (nodelist=
wn1038.sdfarm.kr/0)
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Job Run at
request of root at ce02.sdfarm.kr
12/08/2010 15:06:04;0040;PBS_Server;Req;relay_to_mom;momaddr=134.75.123.138
12/08/2010 15:06:04;0004;PBS_Server;Svr;svr_connect;attempting connect to
host 134.75.123.138 port 15002
12/08/2010 15:06:04;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 240.ce02.sdfarm.kr state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO
(4-15)
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type RunJob on socket 10
12/08/2010 15:06:04;0080;PBS_Server;Req;dis_request_read;decoding command
ModifyJob from root
12/08/2010 15:06:04;0100;PBS_Server;Req;;Type ModifyJob request received
from root at ce02.sdfarm.kr, sock=10
12/08/2010 15:06:04;0008;PBS_Server;Job;dispatch_request;dispatching request
ModifyJob on sd=10
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;attr
Resource_List modified
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;Job Modified at
request of root at ce02.sdfarm.kr
12/08/2010 15:06:04;0040;PBS_Server;Req;relay_to_mom;momaddr=134.75.123.138
12/08/2010 15:06:04;0004;PBS_Server;Svr;svr_connect;attempting connect to
host 134.75.123.138 port 15002
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type ModifyJob on socket 12
12/08/2010 15:06:04;0008;PBS_Server;Job;240.ce02.sdfarm.kr;post_modify_req:
PBSE_UNKJOBID for job 240.ce02.sdfarm.kr in state RUNNING-STAGEGO, dest =
wn1038.sdfarm.kr
12/08/2010 15:06:04;0008;PBS_Server;Job;reply_send;Reply sent for request
type NONE on socket 10


on pbs_mom, I have the following:

12/08/2010 15:06:04;0080;   pbs_mom;Req;dis_request_read;decoding command
CopyFiles from PBS_Server
12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
CopyFiles from host ce02.sdfarm.kr received
12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
CopyFiles from host ce02.sdfarm.kr allowed
12/08/2010 15:06:04;0008;   pbs_mom;Job;dispatch_request;dispatching request
CopyFiles on sd=10
12/08/2010 15:06:04;0008;   pbs_mom;Job;240.ce02.sdfarm.kr;attempting to
copy file 'ce02.sdfarm.kr:
/home/dteam018/.lcgjm/globus-cache-export.io8870/globus-cache-export.io8870.gpg'
12/08/2010 15:06:04;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
pre-sigprocmask
12/08/2010 15:06:04;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
post-initgroups
12/08/2010 15:06:04;0008;   pbs_mom;Job;N/A;forking to user, uid: 11218
 gid: 11200  homedir: '/home/dteam018'
12/08/2010 15:06:04;0002;   pbs_mom;n/a;mom_close_poll;entered
12/08/2010 15:06:04;0080;   pbs_mom;Req;dis_request_read;decoding command
ModifyJob from PBS_Server
12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
ModifyJob from host ce02.sdfarm.kr received
12/08/2010 15:06:04;0008;   pbs_mom;Job;process_request;request type
ModifyJob from host ce02.sdfarm.kr allowed
12/08/2010 15:06:04;0008;   pbs_mom;Job;dispatch_request;dispatching request
ModifyJob on sd=12
12/08/2010 15:06:04;0080;   pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=wn1038.sdfarm.kr MSG=modify job failed,
unknown job 240.ce02.sdfarm.kr), aux=0, type=ModifyJob, from
PBS_Server at ce02.sdfarm.kr
12/08/2010 15:06:04;0080;   pbs_mom;Req;dis_request_read;decoding command
Disconnect from PBS_Server
12/08/2010 15:06:04;0008;   pbs_mom;Job;scan_for_terminated;entered
12/08/2010 15:06:04;0080;   pbs_mom;Svr;mom_get_sample;proc_array load
started
12/08/2010 15:06:04;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded -
nproc=210
12/08/2010 15:06:04;0008;   pbs_mom;Job;scan_for_terminated;pid 12666 not
tracked, statloc=0, exitval=0



To try to find the problem I compiled the sources using debug mode and in
gdb I obtain the following informations:

(gdb) run
Starting program: /usr/sbin/pbs_server
pbs_server is up
entered spec=wn1038.sdfarm.kr
job allocation debug: 1 requested, 8 svr_clnodes, 1 svr_totnodes
node_spec: wn1038.sdfarm.kr nsn 8, nsnfree 8, nsnshared 0
node_spec: wn1038.sdfarm.kr/0 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/1 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/2 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/3 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/4 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/5 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/6 inuse 0x0 nprops 3
node_spec: wn1038.sdfarm.kr/7 inuse 0x0 nprops 3
job allocation debug(2): 1 requested, 1 svr_numnodes
job allocation debug(3): returning 1 requested
allocated node wn1038.sdfarm.kr/0 to job 202.ce02.sdfarm.kr (nsnfree=8)
*** glibc detected *** double free or corruption (!prev): 0x0a6f8e90 ***

Program received signal SIGABRT, Aborted.
0x007047a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) where
#0  0x007047a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00745915 in raise () from /lib/tls/libc.so.6
#2  0x00747379 in abort () from /lib/tls/libc.so.6
#3  0x00779e1a in __libc_message () from /lib/tls/libc.so.6
#4  0x0078081f in _int_free () from /lib/tls/libc.so.6
#5  0x00780c9a in free () from /lib/tls/libc.so.6
#6  0x080648b4 in free_br ()
#7  0x080655f3 in reply_send ()
#8  0x08065636 in reply_ack ()
#9  0x0806f23e in post_modify_req ()
#10 0x0808548f in dispatch_task ()
#11 0x08051957 in process_Dreply ()
#12 0x00f1dace in wait_request (waittime=17, SState=0x8122b3c) at
../Libnet/net_server.c:507
#13 0x080614d3 in main_loop ()
#14 0x0806225f in main ()



Does any one have any suggestions to what I should do from here?

is it possible that the os is too old for the server? is there some
compatibility issues?

Cheers,

Chris.


-- 
------------------------------------------------------
Bonnaud Christophe
GSDC
Korea Institute of Science and Technology Information
Fax. +82-42-869-0789
Tel. +82-42-869-0660
Mobile +82-10-4664-3193
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101207/d2ab2bb8/attachment-0001.html 


More information about the torqueusers mailing list