TORQUE Resource Manager

TORQUE Administrator's Manual - 10.1 Troubleshooting

10.1 Troubleshooting

   There are a few general strategies that can be followed to determine unexpected behavior.  These are a few of the tools available to help determine where problems occur.

10.1.1 Host Resolution

   The TORQUE server host must be able to perform both forward and reverse name lookup on itself and on all compute nodes.  Likewise, all compute nodes must be able to perform both forward and reverse name lookup on itself, the TORQUE server host, and all other compute nodes.  In many cases, name resolution is handled by configuring the node's /etc/hosts file although DNS and NIS services may also be used.  Commands such as nslookup or dig can be used to verify proper host resolution.

NOTE: Invalid host resolution may exhibit itself with compute nodes reporting as down within the output of pbsnodes -a and with failure of the momctl -d 3 command.

10.1.2 Firewall Configuration

   Be sure that if you have firewalls running on the server or node machines that you allow connections on the appropriate ports for each machine.  TORQUE pbs_mom daemons use UDP port 1023 and the pbs_server/pbs_mom daemons use ports 15001-15004 by default.

   Firewall based issues are often associated with server to mom communication failures and messages such as 'premature end of message' in the log files. 

   Also, the tcpdump program can be used to verify the correct network packets are being sent. 


10.1.3 TORQUE Log Files

   The pbs_server keeps a daily log of all activity in the "<TORQUE_HOME_DIR>/server_logs/" directory.  The pbs_mom also keeps a daily log of all activity in the "<TORQUE_HOME_DIR>/mom_logs/" directory.  These logs contain information on communication between server and mom as well as information on jobs as they enter the queue and as they are dispatched, ran, and terminated.  These logs can be very helpful in determining general job failures.  For mom logs, the verbosity of the logging can be adjusted by setting the loglevel parameter in the mom_priv/config file.  For server logs, the verbosity of the logging can be adjusted by setting the server log_level attribute in qmgr.

   For both pbs_mom and pbs_server daemons, the log verbosity level can also be adjusted by setting the environment variable PBSLOGLEVEL to a value between 0 and 7.  Further, to dynamically change the log level of a running daemon, use the SIGUSR1 and SIGUSR2 signals to increase and decrease the active loglevel by one. Signals are sent to a process using the kill command. For example, kill -USR1 `pgrep pbs_mom` would raise the log level up by one. The current loglevel for pbs_mom can be displayed with the command momctl -d3.


10.1.4 Using tracejob to Locate Job Failures

Overview

   The tracejob utility extracts job status and job events from accounting records, mom log files, server log files, and scheduler log files.  Using it can help identify where, how, a why a job failed.  This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.

Syntax

tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID>

  -p : path to PBS_SERVER_HOME
  -w : number of columns of your terminal
  -n : number of days in the past to look for job(s) [default 1]
  -f : filter out types of log entries, multiple -f's can be specified
       error, system, admin, job, job_usage, security, sched, debug, 
       debug2, or absolute numeric hex equivalent
  -z : toggle filtering excessive messages
  -c : what message count is considered excessive
  -a : don't use accounting log files
  -s : don't use server log files
  -l : don't use scheduler log files
  -m : don't use mom log files
  -q : quiet mode - hide all error messages
  -v : verbose mode - show more error messages

Example

> tracejob -n 10 1131

Job: 1131.icluster.org

03/02/2005 17:58:28  S    enqueuing into batch, state 1 hop 1
03/02/2005 17:58:28  S    Job Queued at request of dev@icluster.org, owner =
                          dev@icluster.org, job name = STDIN, queue = batch
03/02/2005 17:58:28  A    queue=batch
03/02/2005 17:58:41  S    Job Run at request of dev@icluster.org
03/02/2005 17:58:41  M    evaluating limits for job
03/02/2005 17:58:41  M    phase 2 of job launch successfully completed
03/02/2005 17:58:41  M    saving task (TMomFinalizeJob3)
03/02/2005 17:58:41  M    job successfully started
03/02/2005 17:58:41  M    job 1131.koa.icluster.org reported successful start on 1 node(s)
03/02/2005 17:58:41  A    user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
                          qtime=1109811508 etime=1109811508 start=1109811521
                          exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=00:01:40
03/02/2005 18:02:11  M    walltime 210 exceeded limit 100
03/02/2005 18:02:11  M    kill_job
03/02/2005 18:02:11  M    kill_job found a task to kill
03/02/2005 18:02:11  M    sending signal 15 to task
03/02/2005 18:02:11  M    kill_task: killing pid 14060 task 1 with sig 15
03/02/2005 18:02:11  M    kill_task: killing pid 14061 task 1 with sig 15
03/02/2005 18:02:11  M    kill_task: killing pid 14063 task 1 with sig 15
03/02/2005 18:02:11  M    kill_job done
03/02/2005 18:04:11  M    kill_job
03/02/2005 18:04:11  M    kill_job found a task to kill
03/02/2005 18:04:11  M    sending signal 15 to task
03/02/2005 18:06:27  M    kill_job
03/02/2005 18:06:27  M    kill_job done
03/02/2005 18:06:27  M    performing job clean-up
03/02/2005 18:06:27  A    user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
                          qtime=1109811508 etime=1109811508 start=1109811521
                          exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=00:01:40 session=14060
                          end=1109811987 Exit_status=265 resources_used.cput=00:00:00
                          resources_used.mem=3544kb resources_used.vmem=10632kb
                          resources_used.walltime=00:07:46

...

NOTE: The tracejob command operates by searching the pbs_server accounting records and the pbs_server, mom, and scheduler logs.  To function properly, it must be run on a node and as a user which can access these files.  By default, these files are all accessible by the user root and only available on the cluster management node.  In particular, the files required by tracejob located in the following directories:

  • $TORQUEHOME/server_priv/accounting
  • $TORQUEHOME/server_logs
  • $TORQUEHOME/mom_logs
  • $TORQUEHOME/sched_logs

   tracejob may only be used on systems where these files are made available.  Non-root users may be able to use this command if the permissions on these directories or files is changed appropriately.


10.1.5 Using GDB to Locate Failures

   If either the pbs_mom or pbs_server fail unexpectedly (and the log files contain no information on the failure) gdb can be used to determine whether or not the program is crashing.  To start pbs_mom or pbs_server under GDB export the environment variable PBSDEBUG=yes and start the program (i.e., gdb pbs_mom and then issue the run subcommand at the gdb prompt).  GDB may run for some time until a failure occurs and which point, a message will be printed to the screen and a gdb prompt again made available.  If this occurs, use the gdb where subcommand to determine the exact location in the code.  The information provided may be adequate to allow local diagnosis and correction.  If not, this output may be sent to the mailing list or to help for further assistance.  (for more information on submitting bugs or requests for help please see the Mailing List Instructions)

NOTE: See the PBSCOREDUMP parameter for enabling creation of core files.


10.1.6 Other Diagnostic Options

  • when PBSDEBUG is set, some client commands will print additional diagnostic information.

$ export PBSDEBUG=yes
$ cmd

Some hard problems in Torque deal with the amount of time spent in routines. For example, one currently open problem appears to be caused by the design of the code in linux/mom_mach.c where the statistics are gathered for the node status. It appears that the /proc filesystem that contains information about the kernel and the processes is being accessed so often on some machines that the responces to some other message traffic is affected. The machine where this is happening has 128 processors.

To debug these kinds of problems, it can be useful to see where in the code time is being spent. This is called profiling and there is a linux utility gprof that will output a listing of routines and the amount of time spent in these routines. This does require that the code be compiled with special options to instrument the code and to produce a file, gmon.out, that will be written at the end of program execution.

The following listing shows how to build Torque with profiling enabled. Notice that the output file for pbs_mom will end up in the mom_priv directory because its startup code changes the default directory to this location.

# ./configure "CFLAGS=-pg -lgcov -fPIC"
# make -j5
# make install
# pbs_mom
... do some stuff for a while ...
# momctl -s
# cd /var/spool/torque/mom_priv
# gprof -b `which pbs_mom` gmon.out |less
#

Another way to see areas where a program is spending most of its time is with the valgrind program. The advantage of using valgrind is that the programs do not have to be specially compiled.

# valgrind --tool=callgrind pbs_mom

10.1.7 Frequently Asked Questions (FAQ)


Cannot connect to server: error=15034

   This error occurs in TORQUE clients (or their APIs) because TORQUE cannot find the server_name file and/or the PBS_DEFAULT environment variable is not set. The server_name file or PBS_DEFAULT variable indicate the pbs_server's hostname that the client tools should communicate with. The server_name file is usually located in TORQUE's local state directory. Make sure the file exists, has proper permissions, and that the version of TORQUE you are running was built with the proper directory settings. Alternatively you can set the PBS_DEFAULT environment variable. Restart TORQUE daemons if you make changes to these settings.


Deleting 'Stuck' Jobs

   To manually delete a stale job which has no process, and for which the mother superior is still alive, sending a sig 0 with qsig will often cause MOM to realize the job is stale and issue the proper JobObit notice.  Failing that, use momctl -c to forcefully cause MOM to purge the job.  The following process should never be necessary:
  • shut down the MOM on the mother superior node
  • delete all files and directories related to the job from "<TORQUEHOMEDIR>/mom_priv/jobs"
  • restart the MOM on the mother superior node.

   If the mother superior mom has been lost and cannot be recovered (i.e, hardware or disk failure), a job running on that node can be purged from the output of qstat using the qdel -p command or can be removed manually using the following steps:

To remove job X:

  1. shutdown pbs_server (qterm)
  2. remove job spool files (rm <TORQUEHOMEDIR>/server_priv/jobs/X.SC <TORQUEHOMEDIR>/server_priv/jobs/X.JB)
  3. restart pbs_server (pbs_server)

Which user must run TORQUE?

   TORQUE (pbs_server & pbs_mom) must be started by a user with root privileges.


Scheduler cannot run jobs - rc: 15003

   For a scheduler, such as Moab or Maui, to control jobs with TORQUE, the scheduler needs to be run be a user in the server operators / managers list (see qmgr (set server operators / managers)). The default for the server operators / managers list is root@localhost. For TORQUE to be used in a grid setting with Silver, the scheduler needs to be run as root.


PBS_Server: pbsd_init, Unable to read server database

   If this message is displayed upon starting pbs_server it means that the local database cannot be read.  This can be for several reasons.  The most likely is a version mismatch.  Most versions of TORQUE can read each others' databases.  However, there are a few incompatibilities between OpenPBS and TORQUE.  Because of enhancements to TORQUE, it cannot read the job database of an OpenPBS server (job structure sizes have been altered to increase functionality).  Also, a compiled in 32 bit mode cannot read a database generated by a 64 bit pbs_server and vice versa.

   To reconstruct a database (excluding the job database), first print out the old data with this command:

%> qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch acl_host_enable = False
set queue batch resources_max.nodect = 6
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch resources_available.nodect = 18
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server managers = griduser@oahu.icluster.org
set server managers += scott@*.icluster.org
set server managers += wightman@*.icluster.org
set server operators = griduser@oahu.icluster.org
set server operators += scott@*.icluster.org
set server operators += wightman@*.icluster.org
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server resources_available.nodect = 80
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6

   Copy this information somewhere. Restart pbs_server with the following command:

> pbs_server -t create

   When it to prompts to overwrite the previous database enter 'y' then enter the data exported by the qmgr command with a command similar to the following:

> cat data | qmgr

   Restart pbs_server without the flags:

> qterm
> pbs_server

   This will reinitialize the database to the current version.  Note that reinitializing the server database will reset the next jobid to 1.


qsub will not allow the submission of jobs requesting many processors

   TORQUE's definition of a node is context sensitive and can appear inconsistent.  The qsub '-l nodes=<X>' expression can at times indicate a request for X processors and other time be interpreted as a request for X nodes.  While qsub allows multiple interpretations of the keyword nodes, aspects of the TORQUE server's logic are not so flexible.  Consequently, if a job is using '-l nodes' to specify processor count and the requested number of processors exceeds the available number of physical nodes, the server daemon will reject the job.

   To get around this issue, the server can be told it has an inflated number of nodes using the resources_available attribute.  To take affect, this attribute should be set on both the server and the associated queue as in the example below.  See resources_available for more information.

> qmgr
Qmgr: set server resources_available.nodect=2048
Qmgr: set queue batch resources_available.nodect=2048

NOTE: The pbs_server daemon will need to be restarted before these changes will take affect.


qsub reports 'Bad UID for job execution'

[guest@login2]$ qsub test.job
qsub: Bad UID for job execution

   Job submission hosts must be explicitly specified within TORQUE or enabled via RCmd security mechanisms in order to be trusted.  In the example above, the host 'login2' is not configured to be trusted.  This process is documented in Configuring Job Submission Hosts describing how this configuration is done.


Why does my job keep bouncing from running to queued?

   There are several reasons why a job will fail to start.  Do you see any errors in the MOM logs?  Be sure to increase the loglevel on MOM if you don't see anything.  Also be sure TORQUE is configured with --enable-syslog and look in /var/log/messages (or wherever your syslog writes).

  Also verify the following on all machines:

  • DNS resolution works correctly with matching forward and reverse
  • time is synchronized across the head and compute nodes
  • user accounts exist on all compute nodes
  • user home directories can be mounted on all compute nodes
  • prologue scripts (if specified) exit with 0

   If using a scheduler such as Moab or Maui, use a scheduler tool such as checkjob to identify job start issues.


How do I use PVM with TORQUE?

  • Start the master pvmd on a compute node and then add the slaves
  • mpiexec can be used to launch slaves using rsh or ssh (use export PVM_RSH=/usr/bin/ssh to use ssh)
NOTE: Access can be managed by rsh/ssh without passwords between the batch nodes, but denying it from anywhere else, including the interactive nodes.  This can be done with xinetd and sshd configuration (root is allowed to ssh everywhere, of course).  This way, the pvm daemons can be started and killed from the job script.

   The problem is that this setup allows the users to bypass the batch system by writing a job script that uses rsh/ssh to launch processes the batch nodes.  If there are relatively few users and they can more or less be trusted, this setup can work.


My build fails attempting to use the TCL library

   TORQUE builds can fail on TCL dependencies even if a version of TCL is available on the system.  TCL is only utilized to support the xpbsmon client.  If your site does not use this tool (most sites do not use xpbsmon), you can work around this failure by rerunning configure with the --disable-gui argument.

My job will not start, failing with the message 'cannot send job to mom, state=PRERUN'

   If a node crashes or other major system failures occur, it is possible that a job may be stuck in a corrupt state on a compute node.  TORQUE 2.2.0 and higher automatically handle this when the mom_job_sync parameter is set via qmgr (the default).  For earlier versions of TORQUE, set this parameter and restart the pbs_mom daemon.

I want to submit and run jobs as root

   While this can be a very bad idea from a security point of view, in some restricted environments this can be quite useful and can be enabled by setting the acl_roots parameter via qmgr command as in the following example:

qmgr
> qmgr -c 's s acl_roots+=root@*'

See Also


How do I determine what version of Torque I am using?

  There are times when you want to find out what version of Torque you are using. An easy way to do this is to run the following command:

qmgr
> qmgr -c "p s"|grep pbs_ver