[torqueusers] Problems after upgrading Torque
raphael.leplae at ulb.ac.be
Thu Dec 13 02:22:54 MST 2012
We have upgraded Torque from 2.5.4 to 4.2.0. The scheduler is Moab 6.0.
Among various problems we have encountered with the upgrade, some are
persisting and we can't find the cause and therefore a possible solution.
1) Error messages bursts in the logs
In the Torque log files, we get regular bursts of the following message:
12/13/2012 09:30:07;0080;PBS_Server.18604;Req;req_reject;Reject reply
code=15021(Invalid credential), aux=0, type=AuthenticateUser, from
<server name> being the host name where pbs_server and moab are running.
The message is repeated ~140 times and the burst is repeated every 30 min.
It is also systematically present in the new log file started after log
There are occasional similar error messages but the 'from' refers to
users on compute/login nodes instead of root on the host running pbs_server.
Note: with the upgrade, we found in the documentation that it was
necessary to run trqauthd. It is running on all the nodes: the control
node with pbs_server, the compute nodes (jobs can submit jobs) and the
login nodes. However, it regularly crashes on the compute nodes (looks
like a random behaviour so far). Is there a way to get a log out of
2) Odd job reporting from the compute nodes
Since the upgrade, we observe the following odd 'qstat -n -1 -r' output
(a sampling of the output):
845605.xxx.yyy <user1> smp1 my_method_with_c -- 1
1 2gb 3000:00: R -- nic95/0
853677.xxx.yyy <user2> mpi I0z028a074 -- 5
20 4gb 48:00:00 R --
853809.xxx.yyy <user2> mpi I0z036a094 20579 5
20 4gb 48:00:00 R 00:27:41
Note: hostname attached to the job ID and username are masked.
Essentially, we get the following combinations of information for the
running jobs (checked and properly running on the nodes):
- Expected output: SessID, resources asked/used, Requested walltime and
- Only SessID is missing, a dash is given.
- SessID and Elapsed time are missing, a dash is given.
- SessID, Elapsed time and requested/used resources are missing, a dash
There is no correlation with the user or the node.
We can even have two jobs of the same user, on the same compute node
with two different versions of the reported odd information described in
the list above.
Solving these first 2 (may be related issues) would be a start.
Any suggestion/help is welcomed.
More information about the torqueusers