[torqueusers] PBS Server Becomes Corrupted after Several Restarts
Mike Dacre
mike.dacre at stanford.edu
Fri Nov 2 18:09:07 MDT 2012
Please ignore this everyone, it is a duplicate of another email I sent.
Sorry about that,
Mike
------------------------------------------------------------------------------
Michael D. Dacre
385 Serra Mall
Herrin Labs, Room 317
Stanford, California 94305
Cell: (650) 308-4173
Phone: (650) 723-1849
Email: mike.dacre at stanford.edu
------------------------------------------------------------------------------
On Fri, Nov 2, 2012 at 2:34 PM, Michael Dacre <mike.dacre at stanford.edu>wrote:
> Hi Everyone,
>
> I am having a major issue I can't figure out. When I start pbs_server I
> get the following error:
>
> PBS_Server: LOG_ERROR::get_parent_and_child, Cannot find closing tag
>
> PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute
> resources_assigned
>
> I also find that and changes I make with qmgr are undone when I restart
> pbs_server and also pbs_server crashes when my users are using it. There
> is nothing in the log, even at log level 7, it just dies. It seems like
> the server can't write to the torque home directory (/var/spool/torque).
> When I start over with pbs_server -t create, the error goes away for a
> while. Then after some number of restarts, the error is back.
>
> At least once after restarting the server, the queue just disappeared.
> All running jobs were deleted from it. No idea why. Also, part of the
> qmgr config disappeared. Not all of it, just the default queue that was
> being used, and some of my changes to the server config.
>
> I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my
> hostname which totally throws it for a loop, and jobs just don't run) with
> maui 3.3.1. It was compiled with the following options:
>
> ./configure --enable-blcr --enable-docs --enable-syslog
>
> The permissions of /var/spool/torque:
> drwxr-xr-x 13 root root 4.0K Oct 24 17:01 .
> drwxr-xr-x. 17 root root 4.0K Oct 23 19:20 ..
> drwxr-xr-x 2 root root 4.0K Oct 24 10:13 aux
> drwxrwxrwt 2 root root 4.0K Oct 23 19:20 checkpoint
> drwxr-xr-x 2 root root 4.0K Oct 23 19:20 job_logs
> drwxr-xr-x 2 root root 4.0K Oct 30 00:01 mom_logs
> drwxr-x--x 3 root root 4.0K Oct 23 19:23 mom_priv
> -rw-r--r-- 1 root root 66 Oct 23 21:07 pbs_environment
> drwxr-xr-x 2 root root 4.0K Oct 23 19:24 sched_logs
> drwxr-x--- 3 root root 4.0K Oct 23 21:07 sched_priv
> drwxr-xr-x 2 root root 4.0K Oct 30 00:00 server_logs
> -rw-r--r-- 1 root root 14 Oct 23 21:07 server_name
> drwxr-x--- 13 root root 4.0K Oct 30 20:05 server_priv
> drwxrwxrwt 2 root root 4.0K Oct 24 10:13 spool
> drwxrwxrwt 2 root root 4.0K Oct 23 19:20 undelivered
>
> output of qmgr -c 'p s':
>
>
>
> ------------------------------------------------------------------------------
> Michael D. Dacre
>
> 385 Serra Mall
> Herrin Labs, Room 317
> Stanford, California 94305
>
> Cell: (650) 308-4173
> Phone: (650) 723-1849
> Email: mike.dacre at stanford.edu
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/c6a9dd98/attachment.html
More information about the torqueusers
mailing list