[torqueusers] PBS Server Becomes Corrupted after Several Restarts

Michael Dacre mike.dacre at stanford.edu
Fri Nov 2 15:34:14 MDT 2012


Hi Everyone,

I am having a major issue I can't figure out.  When I start pbs_server I get the following error:

PBS_Server: LOG_ERROR::get_parent_and_child, Cannot find closing tag

PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute resources_assigned

I also find that and changes I make with qmgr are undone when I restart pbs_server and also pbs_server crashes when my users are using it.  There is nothing in the log, even at log level 7, it just dies.  It seems like the server can't write to the torque home directory (/var/spool/torque).  When I start over with pbs_server -t create, the error goes away for a while.  Then after some number of restarts, the error is back.

At least once after restarting the server, the queue just disappeared.  All running jobs were deleted from it.  No idea why.  Also, part of the qmgr config disappeared.  Not all of it, just the default queue that was being used, and some of my changes to the server config.

I am using torque 4.0.2 (I can't use 4.1.2 because I have a hyphen in my hostname which totally throws it for a loop, and jobs just don't run) with maui 3.3.1.  It was compiled with the following options:

./configure --enable-blcr --enable-docs --enable-syslog

The permissions of /var/spool/torque:
drwxr-xr-x   13  root root 4.0K Oct 24 17:01 .
drwxr-xr-x.  17  root root 4.0K Oct 23 19:20 ..
drwxr-xr-x     2  root root 4.0K Oct 24 10:13 aux
drwxrwxrwt   2  root root 4.0K Oct 23 19:20 checkpoint
drwxr-xr-x     2  root root 4.0K Oct 23 19:20 job_logs
drwxr-xr-x     2  root root 4.0K Oct 30 00:01 mom_logs
drwxr-x--x     3  root root 4.0K Oct 23 19:23 mom_priv
-rw-r--r--        1  root root   66  Oct 23 21:07 pbs_environment
drwxr-xr-x     2  root root 4.0K Oct 23 19:24 sched_logs
drwxr-x---      3  root root 4.0K Oct 23 21:07 sched_priv
drwxr-xr-x     2  root root 4.0K Oct 30 00:00 server_logs
-rw-r--r--        1  root root   14  Oct 23 21:07 server_name
drwxr-x---    13  root root 4.0K Oct 30 20:05 server_priv
drwxrwxrwt   2  root root 4.0K Oct 24 10:13 spool
drwxrwxrwt   2  root root 4.0K Oct 23 19:20 undelivered


output of qmgr -c 'p s':


------------------------------------------------------------------------------
Michael D. Dacre

385 Serra Mall
Herrin Labs, Room 317 
Stanford, California 94305 

Cell:      (650) 308-4173
Phone:  (650) 723-1849
Email:   mike.dacre at stanford.edu (mailto:mike.dacre at stanford.edu)
------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/6d7a5def/attachment-0001.html 


More information about the torqueusers mailing list