[torqueusers] unknow reason:the pbs_server daemon was killed
and can not start
garrick at usc.edu
Wed Nov 12 22:26:56 MST 2008
On Thu, Nov 13, 2008 at 12:50:09PM +0800, Weiguang Chen alleged:
> Our torque version is 2.3.13.
> Today, "qstat" command can not be executed normally, and I found:
> [root at node1 init.d]# qstat
> Cannot connect to default server host 'node1' - check pbs_server daemon.
> qstat: cannot connect to server node1 (errno=111)
> and I checked the pbs_server daemon and found
> -- [root at node1 init.d]# ps -ef|grep pbs
> root 3079 1 0 Sep16 ? 00:01:07 /usr/local/sbin/pbs_sched
> root 16571 5229 0 12:38 pts/21 00:00:00 grep pbs
> The pbs_server daemon was killed by unknow reason
> and when i decided to rerun this daemon, a problem happened:
> [root at node1 init.d]# /usr/local/sbin/pbs_server
> pbs_server: svr_func.c:222: set_resc_assigned: Assertion
> `pjob->ji_qhdr->qu_qs.qu_type == 1' failed.
> What is the problem?
> How i can do?
I've never seen this happen before, but looking at the code tells pretty
clearly what is happening. A routing queue has a resources_assigned property;
which is illegal.
Unless someone has a better idea, you will need to delete the queue definition
for the corrupt queue. cd into $PBS_SERVER_HOME/server_priv/queues/. These
are binary files, but you can grep them for the strings "Route" and
'grep -l resources_assigned * | xargs grep -l Route' shouldn't print anything.
If it does, move that file to another directory. Once pbs_server starts, you
can run 'strings' on that bad queue file and recreate it. Save that file, gzip
it, and send it to me or the list. I'd like to see it.
Or if you have a backup copy of your server config, then create a new serverdb
and recreate your config. Don't forget to set your next job number after
recreating the serverdb.
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
Revoke LDS Church 501(c)(3) Status - http://lds501c3.wordpress.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081112/839dcb43/attachment.bin
More information about the torqueusers