[torqueusers] Major Problem with pbs_server database being corrupted

Mike Dacre mike.dacre at gmail.com
Fri Nov 2 18:08:17 MDT 2012


Hi Ken,

Thanks for the info, and sorry for the multiple submissions, I got confused
because it looked like my emails were being bounced.

I haven't figured out how to repair the serverdb by hand.  What did you
mean by that?

Thanks,

Mike


On Fri, Nov 2, 2012 at 5:00 PM, Ken Nielson
<knielson at adaptivecomputing.com>wrote:

>
>
> On Fri, Nov 2, 2012 at 4:22 PM, Mike Dacre <mike.dacre at gmail.com> wrote:
>
>>  Hi Everyone,
>>
>> I am having a major issue I can't figure out.  When I start pbs_server I
>> get the following error:
>>
>> PBS_Server: LOG_ERROR::get_parent_and_**child, Cannot find closing tag
>>
>> PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute
>> resources_assigned
>>
>> I also find that and changes I make with qmgr are undone when I restart
>> pbs_server and also pbs_server crashes when my users are using it.  There
>> is nothing in the log, even at log level 7, it just dies.  It seems like
>> the server can't write to the torque home directory (/var/spool/torque).
>> When I start over with pbs_server -t create, the error goes away for a
>> while.  Then after some number of restarts, the error is back.
>>
>> This is the third time this has happened, before this the queue at least
>> restarted successfully.  This time, one of my queues just disappeared, and
>> all of the jobs associated with it were deleted when the server was
>> restarted.  This is a MAJOR problem, as it represents hours of lost time
>> for my users.
>>
>> Part of the qmgr config disappeared.  Not all of it, just the default
>> queue that was being used, and some of my changes to the server config.
>>
>> You can look at the attached log.  It is only log level 0, but you can
>> see close to the top where I restarted the server and then all of this
>> mayhem happened.  I should note that I made no changes to the server config
>> before this restart.
>>
>> I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my
>> hostname which totally throws it for a loop, and jobs just don't run) with
>> maui 3.3.1.  It was compiled with the following options:
>>
>> ./configure --enable-blcr --enable-docs --enable-syslog
>>
>> The permissions of /var/spool/torque:
>> drwxr-xr-x   13  root root 4.0K Oct 24 17:01 .
>> drwxr-xr-x.  17  root root 4.0K Oct 23 19:20 ..
>> drwxr-xr-x     2  root root 4.0K Oct 24 10:13 aux
>> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 checkpoint
>> drwxr-xr-x     2  root root 4.0K Oct 23 19:20 job_logs
>> drwxr-xr-x     2  root root 4.0K Oct 30 00:01 mom_logs
>> drwxr-x--x     3  root root 4.0K Oct 23 19:23 mom_priv
>> -rw-r--r--        1  root root   66  Oct 23 21:07 pbs_environment
>> drwxr-xr-x     2  root root 4.0K Oct 23 19:24 sched_logs
>> drwxr-x---      3  root root 4.0K Oct 23 21:07 sched_priv
>> drwxr-xr-x     2  root root 4.0K Oct 30 00:00 server_logs
>> -rw-r--r--        1  root root   14  Oct 23 21:07 server_name
>> drwxr-x---    13  root root 4.0K Oct 30 20:05 server_priv
>> drwxrwxrwt   2  root root 4.0K Oct 24 10:13 spool
>> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 undelivered
>>
>> output of qmgr -c 'p s':
>>
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue default
>> #
>> create queue default
>> set queue default queue_type = Execution
>> set queue default Priority = 0
>> set queue default resources_max.neednodes = slave
>> set queue default resources_default.neednodes = slave
>> set queue default resources_default.nice = 0
>> set queue default resources_available.ncpus = 160
>> set queue default resources_available.neednodes = slave
>> set queue default resources_available.nodes = 20
>> set queue default max_user_run = 100
>> set queue default enabled = True
>> set queue default started = True
>> #
>> # Create and define queue long
>> #
>> create queue long
>> set queue long queue_type = Execution
>> set queue long Priority = -10
>> set queue long max_running = 140
>> set queue long resources_max.mem = 32gb
>> set queue long resources_max.ncpus = 128
>> set queue long resources_max.neednodes = slave
>> set queue long resources_max.nodes = 16
>> set queue long resources_min.cput = 02:00:01
>> set queue long resources_default.mem = 2gb
>> set queue long resources_default.neednodes = slave
>> set queue long resources_default.nice = 15
>> set queue long resources_available.mem = 600gb
>> set queue long resources_available.ncpus = 128
>> set queue long resources_available.neednodes = slave
>> set queue long resources_available.nodes = 16
>> set queue long enabled = True
>> set queue long started = True
>> #
>> # Create and define queue high_priority
>> #
>> create queue high_priority
>> set queue high_priority queue_type = Execution
>> set queue high_priority Priority = 10000
>> set queue high_priority resources_max.walltime = 56:00:00
>> set queue high_priority resources_default.nice = -10
>> set queue high_priority resources_default.walltime = 48:00:00
>> set queue high_priority enabled = True
>> set queue high_priority started = True
>> #
>> # Set server attributes.
>> #
>> set server acl_hosts = fraser-server
>> set server default_queue = default
>> set server log_events = 511
>> set server mail_from = adm
>> set server query_other_jobs = True
>> set server resources_available.mem = 625gb
>> set server resources_default.mem = 4gb
>> set server scheduler_iteration = 600
>> set server node_check_rate = 150
>> set server tcp_timeout = 300
>> set server job_stat_rate = 45
>> set server poll_jobs = True
>> set server mom_job_sync = True
>> set server allow_node_submit = True
>> set server next_job_number = 3301
>> set server moab_array_compatible = True
>>
>> -Mike
>>
>
> Mike,
>
> It looks like you have already figured out that you can repair the
> serverdb file by hand.
>
> TORQUE 4.1.3 is available but it also has a problem with hypen in the host
> name.
>
> Sorry I am not more help at the moment.
>
> Regards
>
> Ken
>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/3ab12714/attachment-0001.html 


More information about the torqueusers mailing list