[torqueusers] PBS Server Becomes Corrupted after Several Restarts

Mike Dacre mike.dacre at stanford.edu
Fri Nov 2 18:09:07 MDT 2012


Please ignore this everyone, it is a duplicate of another email I sent.

Sorry about that,

Mike

------------------------------------------------------------------------------
Michael D. Dacre

385 Serra Mall
Herrin Labs, Room 317
Stanford, California 94305

Cell:      (650) 308-4173
Phone:  (650) 723-1849
Email:   mike.dacre at stanford.edu
------------------------------------------------------------------------------



On Fri, Nov 2, 2012 at 2:34 PM, Michael Dacre <mike.dacre at stanford.edu>wrote:

>  Hi Everyone,
>
> I am having a major issue I can't figure out.  When I start pbs_server I
> get the following error:
>
> PBS_Server: LOG_ERROR::get_parent_and_child, Cannot find closing tag
>
> PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute
> resources_assigned
>
> I also find that and changes I make with qmgr are undone when I restart
> pbs_server and also pbs_server crashes when my users are using it.  There
> is nothing in the log, even at log level 7, it just dies.  It seems like
> the server can't write to the torque home directory (/var/spool/torque).
>  When I start over with pbs_server -t create, the error goes away for a
> while.  Then after some number of restarts, the error is back.
>
> At least once after restarting the server, the queue just disappeared.
>  All running jobs were deleted from it.  No idea why.  Also, part of the
> qmgr config disappeared.  Not all of it, just the default queue that was
> being used, and some of my changes to the server config.
>
> I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my
> hostname which totally throws it for a loop, and jobs just don't run) with
> maui 3.3.1.  It was compiled with the following options:
>
> ./configure --enable-blcr --enable-docs --enable-syslog
>
> The permissions of /var/spool/torque:
> drwxr-xr-x   13  root root 4.0K Oct 24 17:01 .
> drwxr-xr-x.  17  root root 4.0K Oct 23 19:20 ..
> drwxr-xr-x     2  root root 4.0K Oct 24 10:13 aux
> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 checkpoint
> drwxr-xr-x     2  root root 4.0K Oct 23 19:20 job_logs
> drwxr-xr-x     2  root root 4.0K Oct 30 00:01 mom_logs
> drwxr-x--x     3  root root 4.0K Oct 23 19:23 mom_priv
> -rw-r--r--        1  root root   66  Oct 23 21:07 pbs_environment
> drwxr-xr-x     2  root root 4.0K Oct 23 19:24 sched_logs
> drwxr-x---      3  root root 4.0K Oct 23 21:07 sched_priv
> drwxr-xr-x     2  root root 4.0K Oct 30 00:00 server_logs
> -rw-r--r--        1  root root   14  Oct 23 21:07 server_name
> drwxr-x---    13  root root 4.0K Oct 30 20:05 server_priv
> drwxrwxrwt   2  root root 4.0K Oct 24 10:13 spool
> drwxrwxrwt   2  root root 4.0K Oct 23 19:20 undelivered
>
> output of qmgr -c 'p s':
>
>
>
> ------------------------------------------------------------------------------
> Michael D. Dacre
>
> 385 Serra Mall
> Herrin Labs, Room 317
> Stanford, California 94305
>
> Cell:      (650) 308-4173
> Phone:  (650) 723-1849
> Email:   mike.dacre at stanford.edu
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/c6a9dd98/attachment.html 


More information about the torqueusers mailing list