partners placeholder

TORQUE Resource Manager

1. Node state down

2. Missing Nodes

3. "Bad UID for job execution" error in job submissions

4. Assigning number of processors each node contains

 

Problem:

Node state down

When I run the command, pbsnodes -a, the state of all nodes is down.

Solutions:

1. Check to see if the moms are running on each compute node:

Type ps -ef|grep pbs_mom.

If no mom is running on the node, run pbs_mom.

Restart the pbs_server and check pbsnodes -a.

2. Check that the config file on each node is correctly pointing to the Host node.

For each compute host, the MOM server must be configured to trust the pbs_server daemon. In TORQUE 2.0.0p4 and earlier, this is done by creating the “$(TORQUECFG)/mom_priv/config” file and setting the $pbsserver parameter.

$pbsserver      headnode     # note: hostname running pbs_server
$logevent 255 # bitmap of which events to log

In TORQUE 2.0.0p5 and later, this can also be done by creating the "$(TORQUECFG)/server_name" file and placing the server hostname inside.

hostnode

3. Check to see if your server can ping your mom by hostname and that your mom can ping your server by hostname.

4. If you have a firewall in place, make sure ports 15,000-15,004 are open.



Problem:

Missing Nodes

When I run pbsnodes -a all of my nodes are not displayed.

Solutions:

1. Add a node by typing qmgr -c "create node (name)".

2. Check the nodes file on the host node located at "$(TORQUECFG)/server_priv/nodes" to make sure all the nodes are listed.

If they are not, add the nodes, restart the pbs_server and run pbsnodes -a again to check if the missing nodes are now displayed.

node001
node002
node003
node004

2 If the node state is down, see question #1.



3. Problem:

"Bad UID for job execution" error in job submissions

As root, I submit a job using the qsub command. I receive an error when doing so: "Bad UID for job execution."

Solution:

1. TORQUE does not allow job submissions from root. Switch to another user and submit the job again.

2. You must submit the job from a node that is allowed to submit jobs. By default, the head node can do this. To submit a job from another node, see the acl_hosts parameter.



4. Problem:

Assigning number of processors each node contains

My nodes have dual processors, but TORQUE only displays one processor.

Solution:

1. You must configure TORQUE to recognize the node as having more than one processor. You can accomplish this by one of two ways.

Run the command qmgr -c "set node [name] np=[number of procs].

or

In the "$TORQUECFG/server_priv/nodes" file, add np=[number of procs] on the line next to the node name.

Restart the pbs_server and run pbsnodes -a again.

node001 np=2
node002 np=4
...


5. Problem:

TORQUE is running slow

Every once in a while, TORQUE suddenly slows down and requests take a long time to return.

Solution:

1. Check to see if a user is running qstat over and over. In most cases of reported TORQUE slowness, a script is being executed that runs qstat and greps for some kind of output. The script repeats the process until the expected output is found.

2. This behavior can almost always be done in another way. Try using dependencies (a qsub -w option) or triggers, or look for a message of success in the job's output file. Every qstat request takes time to process and incessant qstat requests can slow down even very fast systems.