[torqueusers] Major Problem with pbs_server database being corrupted

Ken Nielson knielson at adaptivecomputing.com
Fri Nov 2 18:00:40 MDT 2012


On Fri, Nov 2, 2012 at 4:22 PM, Mike Dacre <mike.dacre at gmail.com> wrote:

>  Hi Everyone,
>
> I am having a major issue I can't figure out.  When I start pbs_server I
> get the following error:
>
> PBS_Server: LOG_ERROR::get_parent_and_**child, Cannot find closing tag
>
> PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute
> resources_assigned
>
> I also find that and changes I make with qmgr are undone when I restart
> pbs_server and also pbs_server crashes when my users are using it.  There
> is nothing in the log, even at log level 7, it just dies.  It seems like
> the server can't write to the torque home directory (/var/spool/torque).
> When I start over with pbs_server -t create, the error goes away for a
> while.  Then after some number of restarts, the error is back.
>
> This is the third time this has happened, before this the queue at least
> restarted successfully.  This time, one of my queues just disappeared, and
> all of the jobs associated with it were deleted when the server was
> restarted.  This is a MAJOR problem, as it represents hours of lost time
> for my users.
>
> Part of the qmgr config disappeared.  Not all of it, just the default
> queue that was being used, and some of my changes to the server config.
>
> You can look at the attached log.  It is only log level 0, but you can see
> close to the top where I restarted the server and then all of this mayhem
> happened.  I should note that I made no changes to the server config before
> this restart.
>
> I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my
> hostname which totally throws it for a loop, and jobs just don't run) with
> maui 3.3.1.  It was compiled with the following options:
>
> ./configure --enable-blcr --enable-docs --enable-syslog
>
> The permissions of /var/spool/torque:
> drwxr-xr-x   13  root root 4.0K Oct 24 17:01 .
> drwxr-xr-x.  17  root root 4.0K Oct 23 19:20 ..
> drwxr-xr-x     2  root root 4.0K Oct 24 10:13 aux
> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 checkpoint
> drwxr-xr-x     2  root root 4.0K Oct 23 19:20 job_logs
> drwxr-xr-x     2  root root 4.0K Oct 30 00:01 mom_logs
> drwxr-x--x     3  root root 4.0K Oct 23 19:23 mom_priv
> -rw-r--r--        1  root root   66  Oct 23 21:07 pbs_environment
> drwxr-xr-x     2  root root 4.0K Oct 23 19:24 sched_logs
> drwxr-x---      3  root root 4.0K Oct 23 21:07 sched_priv
> drwxr-xr-x     2  root root 4.0K Oct 30 00:00 server_logs
> -rw-r--r--        1  root root   14  Oct 23 21:07 server_name
> drwxr-x---    13  root root 4.0K Oct 30 20:05 server_priv
> drwxrwxrwt   2  root root 4.0K Oct 24 10:13 spool
> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 undelivered
>
> output of qmgr -c 'p s':
>
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue default
> #
> create queue default
> set queue default queue_type = Execution
> set queue default Priority = 0
> set queue default resources_max.neednodes = slave
> set queue default resources_default.neednodes = slave
> set queue default resources_default.nice = 0
> set queue default resources_available.ncpus = 160
> set queue default resources_available.neednodes = slave
> set queue default resources_available.nodes = 20
> set queue default max_user_run = 100
> set queue default enabled = True
> set queue default started = True
> #
> # Create and define queue long
> #
> create queue long
> set queue long queue_type = Execution
> set queue long Priority = -10
> set queue long max_running = 140
> set queue long resources_max.mem = 32gb
> set queue long resources_max.ncpus = 128
> set queue long resources_max.neednodes = slave
> set queue long resources_max.nodes = 16
> set queue long resources_min.cput = 02:00:01
> set queue long resources_default.mem = 2gb
> set queue long resources_default.neednodes = slave
> set queue long resources_default.nice = 15
> set queue long resources_available.mem = 600gb
> set queue long resources_available.ncpus = 128
> set queue long resources_available.neednodes = slave
> set queue long resources_available.nodes = 16
> set queue long enabled = True
> set queue long started = True
> #
> # Create and define queue high_priority
> #
> create queue high_priority
> set queue high_priority queue_type = Execution
> set queue high_priority Priority = 10000
> set queue high_priority resources_max.walltime = 56:00:00
> set queue high_priority resources_default.nice = -10
> set queue high_priority resources_default.walltime = 48:00:00
> set queue high_priority enabled = True
> set queue high_priority started = True
> #
> # Set server attributes.
> #
> set server acl_hosts = fraser-server
> set server default_queue = default
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server resources_available.mem = 625gb
> set server resources_default.mem = 4gb
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server allow_node_submit = True
> set server next_job_number = 3301
> set server moab_array_compatible = True
>
> -Mike
>

Mike,

It looks like you have already figured out that you can repair the serverdb
file by hand.

TORQUE 4.1.3 is available but it also has a problem with hypen in the host
name.

Sorry I am not more help at the moment.

Regards

Ken

>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/af20762b/attachment.html 


More information about the torqueusers mailing list