[torqueusers] Major Problem with pbs_server database being corrupted

Mike Dacre mike.dacre at gmail.com
Fri Nov 2 17:22:56 MDT 2012


Hi Everyone,

I am having a major issue I can't figure out.  When I start pbs_server I get the following error:

PBS_Server: LOG_ERROR::get_parent_and_child, Cannot find closing tag

PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute resources_assigned

I also find that and changes I make with qmgr are undone when I restart pbs_server and also pbs_server crashes when my users are using it.  There is nothing in the log, even at log level 7, it just dies.  It seems like the server can't write to the torque home directory (/var/spool/torque).  When I start over with pbs_server -t create, the error goes away for a while.  Then after some number of restarts, the error is back.

This is the third time this has happened, before this the queue at least restarted successfully.  This time, one of my queues just disappeared, and all of the jobs associated with it were deleted when the server was restarted.  This is a MAJOR problem, as it represents hours of lost time for my users.

Part of the qmgr config disappeared.  Not all of it, just the default queue that was being used, and some of my changes to the server config.

You can look at the attached log.  It is only log level 0, but you can see close to the top where I restarted the server and then all of this mayhem happened.  I should note that I made no changes to the server config before this restart.

I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my hostname which totally throws it for a loop, and jobs just don't run) with maui 3.3.1.  It was compiled with the following options:

./configure --enable-blcr --enable-docs --enable-syslog

The permissions of /var/spool/torque:
drwxr-xr-x   13  root root 4.0K Oct 24 17:01 .
drwxr-xr-x.  17  root root 4.0K Oct 23 19:20 ..
drwxr-xr-x     2  root root 4.0K Oct 24 10:13 aux
drwxrwxrwt   2  root root 4.0K Oct 23 19:20 checkpoint
drwxr-xr-x     2  root root 4.0K Oct 23 19:20 job_logs
drwxr-xr-x     2  root root 4.0K Oct 30 00:01 mom_logs
drwxr-x--x     3  root root 4.0K Oct 23 19:23 mom_priv
-rw-r--r--        1  root root   66  Oct 23 21:07 pbs_environment
drwxr-xr-x     2  root root 4.0K Oct 23 19:24 sched_logs
drwxr-x---      3  root root 4.0K Oct 23 21:07 sched_priv
drwxr-xr-x     2  root root 4.0K Oct 30 00:00 server_logs
-rw-r--r--        1  root root   14  Oct 23 21:07 server_name
drwxr-x---    13  root root 4.0K Oct 30 20:05 server_priv
drwxrwxrwt   2  root root 4.0K Oct 24 10:13 spool
drwxrwxrwt   2  root root 4.0K Oct 23 19:20 undelivered

output of qmgr -c 'p s':

#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default Priority = 0
set queue default resources_max.neednodes = slave
set queue default resources_default.neednodes = slave
set queue default resources_default.nice = 0
set queue default resources_available.ncpus = 160
set queue default resources_available.neednodes = slave
set queue default resources_available.nodes = 20
set queue default max_user_run = 100
set queue default enabled = True
set queue default started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long Priority = -10
set queue long max_running = 140
set queue long resources_max.mem = 32gb
set queue long resources_max.ncpus = 128
set queue long resources_max.neednodes = slave
set queue long resources_max.nodes = 16
set queue long resources_min.cput = 02:00:01
set queue long resources_default.mem = 2gb
set queue long resources_default.neednodes = slave
set queue long resources_default.nice = 15
set queue long resources_available.mem = 600gb
set queue long resources_available.ncpus = 128
set queue long resources_available.neednodes = slave
set queue long resources_available.nodes = 16
set queue long enabled = True
set queue long started = True
#
# Create and define queue high_priority
#
create queue high_priority
set queue high_priority queue_type = Execution
set queue high_priority Priority = 10000
set queue high_priority resources_max.walltime = 56:00:00
set queue high_priority resources_default.nice = -10
set queue high_priority resources_default.walltime = 48:00:00
set queue high_priority enabled = True
set queue high_priority started = True
#
# Set server attributes.
#
set server acl_hosts = fraser-server
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_available.mem = 625gb
set server resources_default.mem = 4gb
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server allow_node_submit = True
set server next_job_number = 3301
set server moab_array_compatible = True

-Mike 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/3136a4c7/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: torque_log.txt
Type: application/octet-stream
Size: 337537 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/3136a4c7/attachment-0001.obj 


More information about the torqueusers mailing list