[torqueusers] Major Problem with pbs_server database being corrupted

Ken Nielson knielson at adaptivecomputing.com
Fri Nov 2 18:14:34 MDT 2012


Make,

If you open the serverdb file you will see it uses XML.

To get an example of what it should look like you can create a new serverdb
in a different directory or save your current file to another name and then
run torque.setup or pbs_server -t create.

The serverdb that will be created can be used to fill in any of the holes
left in your original serverdb. Usually there are some entries in the
beginning that are not displayed in qmgr that need to be in place. Anything
that displays in qmgr can be fixed by putting the parameter name in an
opening and closing tag and putting the appropriate value in between.

Let me know if you have any more questions.

Ken

On Fri, Nov 2, 2012 at 5:08 PM, Mike Dacre <mike.dacre at gmail.com> wrote:

> Hi Ken,
>
> Thanks for the info, and sorry for the multiple submissions, I got
> confused because it looked like my emails were being bounced.
>
> I haven't figured out how to repair the serverdb by hand.  What did you
> mean by that?
>
> Thanks,
>
> Mike
>
>
> On Fri, Nov 2, 2012 at 5:00 PM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>>
>>
>> On Fri, Nov 2, 2012 at 4:22 PM, Mike Dacre <mike.dacre at gmail.com> wrote:
>>
>>>  Hi Everyone,
>>>
>>> I am having a major issue I can't figure out.  When I start pbs_server I
>>> get the following error:
>>>
>>> PBS_Server: LOG_ERROR::get_parent_and_**child, Cannot find closing tag
>>>
>>> PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute
>>> resources_assigned
>>>
>>> I also find that and changes I make with qmgr are undone when I restart
>>> pbs_server and also pbs_server crashes when my users are using it.  There
>>> is nothing in the log, even at log level 7, it just dies.  It seems like
>>> the server can't write to the torque home directory (/var/spool/torque).
>>> When I start over with pbs_server -t create, the error goes away for a
>>> while.  Then after some number of restarts, the error is back.
>>>
>>> This is the third time this has happened, before this the queue at least
>>> restarted successfully.  This time, one of my queues just disappeared, and
>>> all of the jobs associated with it were deleted when the server was
>>> restarted.  This is a MAJOR problem, as it represents hours of lost time
>>> for my users.
>>>
>>> Part of the qmgr config disappeared.  Not all of it, just the default
>>> queue that was being used, and some of my changes to the server config.
>>>
>>> You can look at the attached log.  It is only log level 0, but you can
>>> see close to the top where I restarted the server and then all of this
>>> mayhem happened.  I should note that I made no changes to the server config
>>> before this restart.
>>>
>>> I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my
>>> hostname which totally throws it for a loop, and jobs just don't run) with
>>> maui 3.3.1.  It was compiled with the following options:
>>>
>>> ./configure --enable-blcr --enable-docs --enable-syslog
>>>
>>> The permissions of /var/spool/torque:
>>> drwxr-xr-x   13  root root 4.0K Oct 24 17:01 .
>>> drwxr-xr-x.  17  root root 4.0K Oct 23 19:20 ..
>>> drwxr-xr-x     2  root root 4.0K Oct 24 10:13 aux
>>> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 checkpoint
>>> drwxr-xr-x     2  root root 4.0K Oct 23 19:20 job_logs
>>> drwxr-xr-x     2  root root 4.0K Oct 30 00:01 mom_logs
>>> drwxr-x--x     3  root root 4.0K Oct 23 19:23 mom_priv
>>> -rw-r--r--        1  root root   66  Oct 23 21:07 pbs_environment
>>> drwxr-xr-x     2  root root 4.0K Oct 23 19:24 sched_logs
>>> drwxr-x---      3  root root 4.0K Oct 23 21:07 sched_priv
>>> drwxr-xr-x     2  root root 4.0K Oct 30 00:00 server_logs
>>> -rw-r--r--        1  root root   14  Oct 23 21:07 server_name
>>> drwxr-x---    13  root root 4.0K Oct 30 20:05 server_priv
>>> drwxrwxrwt   2  root root 4.0K Oct 24 10:13 spool
>>> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 undelivered
>>>
>>> output of qmgr -c 'p s':
>>>
>>> #
>>> # Create queues and set their attributes.
>>> #
>>> #
>>> # Create and define queue default
>>> #
>>> create queue default
>>> set queue default queue_type = Execution
>>> set queue default Priority = 0
>>> set queue default resources_max.neednodes = slave
>>> set queue default resources_default.neednodes = slave
>>> set queue default resources_default.nice = 0
>>> set queue default resources_available.ncpus = 160
>>> set queue default resources_available.neednodes = slave
>>> set queue default resources_available.nodes = 20
>>> set queue default max_user_run = 100
>>> set queue default enabled = True
>>> set queue default started = True
>>> #
>>> # Create and define queue long
>>> #
>>> create queue long
>>> set queue long queue_type = Execution
>>> set queue long Priority = -10
>>> set queue long max_running = 140
>>> set queue long resources_max.mem = 32gb
>>> set queue long resources_max.ncpus = 128
>>> set queue long resources_max.neednodes = slave
>>> set queue long resources_max.nodes = 16
>>> set queue long resources_min.cput = 02:00:01
>>> set queue long resources_default.mem = 2gb
>>> set queue long resources_default.neednodes = slave
>>> set queue long resources_default.nice = 15
>>> set queue long resources_available.mem = 600gb
>>> set queue long resources_available.ncpus = 128
>>> set queue long resources_available.neednodes = slave
>>> set queue long resources_available.nodes = 16
>>> set queue long enabled = True
>>> set queue long started = True
>>> #
>>> # Create and define queue high_priority
>>> #
>>> create queue high_priority
>>> set queue high_priority queue_type = Execution
>>> set queue high_priority Priority = 10000
>>> set queue high_priority resources_max.walltime = 56:00:00
>>> set queue high_priority resources_default.nice = -10
>>> set queue high_priority resources_default.walltime = 48:00:00
>>> set queue high_priority enabled = True
>>> set queue high_priority started = True
>>> #
>>> # Set server attributes.
>>> #
>>> set server acl_hosts = fraser-server
>>> set server default_queue = default
>>> set server log_events = 511
>>> set server mail_from = adm
>>> set server query_other_jobs = True
>>> set server resources_available.mem = 625gb
>>> set server resources_default.mem = 4gb
>>> set server scheduler_iteration = 600
>>> set server node_check_rate = 150
>>> set server tcp_timeout = 300
>>> set server job_stat_rate = 45
>>> set server poll_jobs = True
>>> set server mom_job_sync = True
>>> set server allow_node_submit = True
>>> set server next_job_number = 3301
>>> set server moab_array_compatible = True
>>>
>>> -Mike
>>>
>>
>> Mike,
>>
>> It looks like you have already figured out that you can repair the
>> serverdb file by hand.
>>
>> TORQUE 4.1.3 is available but it also has a problem with hypen in the
>> host name.
>>
>> Sorry I am not more help at the moment.
>>
>> Regards
>>
>> Ken
>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/e883d294/attachment-0001.html 


More information about the torqueusers mailing list