|
|||||||||||||
10.1 TroubleshootingThere are a few general strategies that can be followed to determine the cause of unexpected behavior. These are a few of the tools available to help determine where problems occur.
10.1.1 Host ResolutionThe TORQUE server host must be able to perform both forward and reverse name lookup on itself and on all compute nodes. Likewise, each compute node must be able to perform forward and reverse name lookup on itself, the TORQUE server host, and all other compute nodes. In many cases, name resolution is handled by configuring the node's /etc/hosts file although DNS and NIS services may also be used. Commands such as nslookup or dig can be used to verify proper host resolution.
10.1.2 Firewall ConfigurationBe sure that, if you have firewalls running on the server or node machines, you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP ports 1023 and below if privileged ports are configured (privileged ports is the default). The pbs_server and pbs_mom daemons use TCP and UDP ports 15001-15004 by default. Firewall based issues are often associated with server to mom communication failures and messages such as 'premature end of message' in the log files. Also, the tcpdump program can be used to verify the correct network packets are being sent. 10.1.3 TORQUE Log FilesThe pbs_server keeps a daily log of all activity in the $TORQUEHOME/server_logs directory. The pbs_mom also keeps a daily log of all activity in the $TORQUEHOME/mom_logs/ directory. These logs contain information on communication between server and mom as well as information on jobs as they enter the queue and as they are dispatched, run, and terminated. These logs can be very helpful in determining general job failures. For mom logs, the verbosity of the logging can be adjusted by setting the loglevel parameter in the mom_priv/config file. For server logs, the verbosity of the logging can be adjusted by setting the server log_level attribute in qmgr. For both pbs_mom and pbs_server daemons, the log verbosity level can also be adjusted by setting the environment variable PBSLOGLEVEL to a value between 0 and 7. Further, to dynamically change the log level of a running daemon, use the SIGUSR1 and SIGUSR2 signals to increase and decrease the active loglevel by one. Signals are sent to a process using the kill command. For example, kill -USR1 `pgrep pbs_mom` would raise the log level up by one. The current loglevel for pbs_mom can be displayed with the command momctl -d3. 10.1.4 Using tracejob to Locate Job FailuresOverviewThe tracejob utility extracts job status and job events from accounting records, mom log files, server log files, and scheduler log files. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.Syntax
tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID>
-p : path to PBS_SERVER_HOME
-w : number of columns of your terminal
-n : number of days in the past to look for job(s) [default 1]
-f : filter out types of log entries, multiple -f's can be specified
error, system, admin, job, job_usage, security, sched, debug,
debug2, or absolute numeric hex equivalent
-z : toggle filtering excessive messages
-c : what message count is considered excessive
-a : don't use accounting log files
-s : don't use server log files
-l : don't use scheduler log files
-m : don't use mom log files
-q : quiet mode - hide all error messages
-v : verbose mode - show more error messages
Example
> tracejob -n 10 1131
Job: 1131.icluster.org
03/02/2005 17:58:28 S enqueuing into batch, state 1 hop 1
03/02/2005 17:58:28 S Job Queued at request of dev@icluster.org, owner =
dev@icluster.org, job name = STDIN, queue = batch
03/02/2005 17:58:28 A queue=batch
03/02/2005 17:58:41 S Job Run at request of dev@icluster.org
03/02/2005 17:58:41 M evaluating limits for job
03/02/2005 17:58:41 M phase 2 of job launch successfully completed
03/02/2005 17:58:41 M saving task (TMomFinalizeJob3)
03/02/2005 17:58:41 M job successfully started
03/02/2005 17:58:41 M job 1131.koa.icluster.org reported successful start on 1 node(s)
03/02/2005 17:58:41 A user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
qtime=1109811508 etime=1109811508 start=1109811521
exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=00:01:40
03/02/2005 18:02:11 M walltime 210 exceeded limit 100
03/02/2005 18:02:11 M kill_job
03/02/2005 18:02:11 M kill_job found a task to kill
03/02/2005 18:02:11 M sending signal 15 to task
03/02/2005 18:02:11 M kill_task: killing pid 14060 task 1 with sig 15
03/02/2005 18:02:11 M kill_task: killing pid 14061 task 1 with sig 15
03/02/2005 18:02:11 M kill_task: killing pid 14063 task 1 with sig 15
03/02/2005 18:02:11 M kill_job done
03/02/2005 18:04:11 M kill_job
03/02/2005 18:04:11 M kill_job found a task to kill
03/02/2005 18:04:11 M sending signal 15 to task
03/02/2005 18:06:27 M kill_job
03/02/2005 18:06:27 M kill_job done
03/02/2005 18:06:27 M performing job clean-up
03/02/2005 18:06:27 A user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
qtime=1109811508 etime=1109811508 start=1109811521
exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=00:01:40 session=14060
end=1109811987 Exit_status=265 resources_used.cput=00:00:00
resources_used.mem=3544kb resources_used.vmem=10632kb
resources_used.walltime=00:07:46
...
tracejob may only be used on systems where these files are made available. Non-root users may be able to use this command if the permissions on these directories or files is changed appropriately. 10.1.5 Using GDB to Locate FailuresIf either the pbs_mom or pbs_server fail unexpectedly (and the log files contain no information on the failure) gdb can be used to determine whether or not the program is crashing. To start pbs_mom or pbs_server under GDB export the environment variable PBSDEBUG=yes and start the program (i.e., gdb pbs_mom and then issue the run subcommand at the gdb prompt). GDB may run for some time until a failure occurs and at which point, a message will be printed to the screen and a gdb prompt again made available. If this occurs, use the gdb where subcommand to determine the exact location in the code. The information provided may be adequate to allow local diagnosis and correction. If not, this output may be sent to the mailing list or to help for further assistance. (for more information on submitting bugs or requests for help please see the Mailing List Instructions)
10.1.6 Other Diagnostic OptionsWhen PBSDEBUG is set, some client commands will print additional diagnostic information. $ export PBSDEBUG=yes $ cmd To debug different kinds of problems, it can be useful to see where in the code time is being spent. This is called profiling and there is a linux utility gprof that will output a listing of routines and the amount of time spent in these routines. This does require that the code be compiled with special options to instrument the code and to produce a file, gmon.out, that will be written at the end of program execution. The following listing shows how to build TORQUE with profiling enabled. Notice that the output file for pbs_mom will end up in the mom_priv directory because its startup code changes the default directory to this location.
# ./configure "CFLAGS=-pg -lgcov -fPIC" # make -j5 # make install # pbs_mom ... do some stuff for a while ... # momctl -s # cd /var/spool/torque/mom_priv # gprof -b `which pbs_mom` gmon.out |less # Another way to see areas where a program is spending most of its time is with the valgrind program. The advantage of using valgrind is that the programs do not have to be specially compiled.
# valgrind --tool=callgrind pbs_mom 10.1.7 Stuck JobsIf a job gets stuck in TORQUE, try these suggestions to resolve the issue.
10.1.8 Frequently Asked Questions (FAQ)
Cannot connect to server: error=15034This error occurs in TORQUE clients (or their APIs) because TORQUE cannot find the server_name file and/or the PBS_DEFAULT environment variable is not set. The server_name file or PBS_DEFAULT variable indicate the pbs_server's hostname that the client tools should communicate with. The server_name file is usually located in TORQUE's local state directory. Make sure the file exists, has proper permissions, and that the version of TORQUE you are running was built with the proper directory settings. Alternatively you can set the PBS_DEFAULT environment variable. Restart TORQUE daemons if you make changes to these settings.
Deleting 'Stuck' JobsTo manually delete a stale job which has no process, and for which the mother superior is still alive, sending a sig 0 with qsig will often cause MOM to realize the job is stale and issue the proper JobObit notice. Failing that, use momctl -c to forcefully cause MOM to purge the job. The following process should never be necessary:
If the mother superior mom has been lost and cannot be recovered (i.e, hardware or disk failure), a job running on that node can be purged from the output of qstat using the qdel -p command or can be removed manually using the following steps: To remove job X:
> qterm > rm $TORQUEHOME/server_priv/jobs/X.SC $TORQUEHOME/server_priv/jobs/X.JB > pbs_server Which user must run TORQUE?TORQUE (pbs_server & pbs_mom) must be started by a user with root privileges.
Scheduler cannot run jobs - rc: 15003For a scheduler, such as Moab or Maui, to control jobs with TORQUE, the scheduler needs to be run be a user in the server operators / managers list (see qmgr (set server operators / managers)). The default for the server operators / managers list is root@localhost. For TORQUE to be used in a grid setting with Silver, the scheduler needs to be run as root.
PBS_Server: pbsd_init, Unable to read server databaseIf this message is displayed upon starting pbs_server it means that the local database cannot be read. This can be for several reasons. The most likely is a version mismatch. Most versions of TORQUE can read each others' databases. However, there are a few incompatibilities between OpenPBS and TORQUE. Because of enhancements to TORQUE, it cannot read the job database of an OpenPBS server (job structure sizes have been altered to increase functionality). Also, a compiled in 32 bit mode cannot read a database generated by a 64 bit pbs_server and vice versa. To reconstruct a database (excluding the job database), first print out the old data with this command:
%> qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch acl_host_enable = False set queue batch resources_max.nodect = 6 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch resources_available.nodect = 18 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server managers = griduser@oahu.icluster.org set server managers += scott@*.icluster.org set server managers += wightman@*.icluster.org set server operators = griduser@oahu.icluster.org set server operators += scott@*.icluster.org set server operators += wightman@*.icluster.org set server default_queue = batch set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 80 set server scheduler_iteration = 600 set server node_ping_rate = 300 set server node_check_rate = 600 set server tcp_timeout = 6 Copy this information somewhere. Restart pbs_server with the following command:
> pbs_server -t create When it to prompts to overwrite the previous database enter 'y' then enter the data exported by the qmgr command with a command similar to the following:
> cat data | qmgr Restart pbs_server without the flags:
> qterm > pbs_server This will reinitialize the database to the current version. Note that reinitializing the server database will reset the next jobid to 1. qsub will not allow the submission of jobs requesting many processorsTORQUE's definition of a node is context sensitive and can appear inconsistent. The qsub '-l nodes=<X>' expression can at times indicate a request for X processors and other time be interpreted as a request for X nodes. While qsub allows multiple interpretations of the keyword nodes, aspects of the TORQUE server's logic are not so flexible. Consequently, if a job is using '-l nodes' to specify processor count and the requested number of processors exceeds the available number of physical nodes, the server daemon will reject the job.To get around this issue, the server can be told it has an inflated number of nodes using the resources_available attribute. To take affect, this attribute should be set on both the server and the associated queue as in the example below. See resources_available for more information.
> qmgr Qmgr: set server resources_available.nodect=2048 Qmgr: set queue batch resources_available.nodect=2048
qsub reports 'Bad UID for job execution'[guest@login2]$ qsub test.job qsub: Bad UID for job execution Job submission hosts must be explicitly specified within TORQUE or enabled via RCmd security mechanisms in order to be trusted. In the example above, the host 'login2' is not configured to be trusted. This process is documented in Configuring Job Submission Hosts. Why does my job keep bouncing from running to queued?There are several reasons why a job will fail to start. Do you see any errors in the MOM logs? Be sure to increase the loglevel on MOM if you don't see anything. Also be sure TORQUE is configured with --enable-syslog and look in /var/log/messages (or wherever your syslog writes).Also verify the following on all machines:
If using a scheduler such as Moab or Maui, use a scheduler tool such as checkjob to identify job start issues. How do I use PVM with TORQUE?
The problem is that this setup allows the users to bypass the batch system by writing a job script that uses rsh/ssh to launch processes on the batch nodes. If there are relatively few users and they can more or less be trusted, this setup can work. My build fails attempting to use the TCL libraryTORQUE builds can fail on TCL dependencies even if a version of TCL is available on the system. TCL is only utilized to support the xpbsmon client. If your site does not use this tool (most sites do not use xpbsmon), you can work around this failure by rerunning configure with the --disable-gui argument.My job will not start, failing with the message 'cannot send job to mom, state=PRERUN'If a node crashes or other major system failures occur, it is possible that a job may be stuck in a corrupt state on a compute node. TORQUE 2.2.0 and higher automatically handle this when the mom_job_sync parameter is set via qmgr (the default). For earlier versions of TORQUE, set this parameter and restart the pbs_mom daemon. This error can also occur if not enough free space is available on the partition that holds TORQUE. I want to submit and run jobs as rootWhile this can be a very bad idea from a security point of view, in some restricted environments this can be quite useful and can be enabled by setting the acl_roots parameter via qmgr command as in the following example:
qmgr
> qmgr -c 's s acl_roots+=root@*' How do I determine what version of Torque I am using?There are times when you want to find out what version of Torque you are using. An easy way to do this is to run the following command:
qmgr
> qmgr -c "p s" | grep pbs_ver See Also
|
|||||||||||||
| © 2001-2010 Adaptive Computing Enterprises, Inc. | |||||||||||||