[torqueusers] Problems after upgrading Torque

Raphael Leplae raphael.leplae at ulb.ac.be
Thu Dec 13 02:22:54 MST 2012


Dear all,

We have upgraded Torque from 2.5.4 to 4.2.0. The scheduler is Moab 6.0.

Among various problems we have encountered with the upgrade, some are 
persisting and we can't find the cause and therefore a possible solution.

Among them:

1) Error messages bursts in the logs
In the Torque log files, we get regular bursts of the following message:
12/13/2012 09:30:07;0080;PBS_Server.18604;Req;req_reject;Reject reply 
code=15021(Invalid credential), aux=0, type=AuthenticateUser, from 
root@<server name>

<server name> being the host name where pbs_server and moab are running.
The message is repeated ~140 times and the burst is repeated every 30 min.
It is also systematically present in the new log file started after log 
rotation (midnight).

There are occasional similar error messages but the 'from' refers to 
users on compute/login nodes instead of root on the host running pbs_server.

Note: with the upgrade, we found in the documentation that it was 
necessary to run trqauthd. It is running on all the nodes: the control 
node with pbs_server, the compute nodes (jobs can submit jobs) and the 
login nodes. However, it regularly crashes on the compute nodes (looks 
like a random behaviour so far). Is there a way to get a log out of 
trqauthd?

2) Odd job reporting from the compute nodes
Since the upgrade, we observe the following odd 'qstat -n -1 -r' output 
(a sampling of the output):
845605.xxx.yyy     <user1>    smp1     my_method_with_c    --      1 
   1    2gb 3000:00: R      --    nic95/0
853677.xxx.yyy     <user2>    mpi      I0z028a074          --      5 
  20    4gb 48:00:00 R      -- 
nic62/7+nic62/6+nic62/5+nic62/4+nic53/3+nic53/2+nic53/1+nic53/0+nic55/15+nic55/14+nic55/13+nic55/4+nic56/15+nic56/14+nic56/13+nic56/4+nic50/3+nic50/2+nic50/1+nic50/0
853809.xxx.yyy     <user2>    mpi      I0z036a094        20579     5 
  20    4gb 48:00:00 R 00:27:41 
nic62/3+nic62/2+nic62/1+nic62/0+nic55/3+nic55/2+nic55/1+nic55/0+nic56/3+nic56/2+nic56/1+nic56/0+nic59/15+nic59/14+nic59/13+nic59/12+nic54/15+nic54/14+nic54/13+nic54/4

Note: hostname attached to the job ID and username are masked.

Essentially, we get the following combinations of information for the 
running jobs (checked and properly running on the nodes):
- Expected output: SessID, resources asked/used, Requested walltime and 
Elasped time.
- Only SessID is missing, a dash is given.
- SessID and Elapsed time are missing, a dash is given.
- SessID, Elapsed time and requested/used resources are missing, a dash 
is given.

There is no correlation with the user or the node.
We can even have two jobs of the same user, on the same compute node 
with two different versions of the reported odd information described in 
the list above.

Solving these first 2 (may be related issues) would be a start.

Any suggestion/help is welcomed.

Cheers


More information about the torqueusers mailing list