[torqueusers] Three more possible 2.5 (beta) bugs
stuartb at 4gh.net
Thu Jul 22 09:03:07 MDT 2010
Sorry to do this after the release, but our system is very new and I'm
still getting configurations under control. These possible issues are
with torque-2.5-beta_0.20100702 (and Moab 5.4). Looking at diffs to
2.5.0 I don't see anything that looks like fixes.
Problem 1: pbs_server crash:
For a while I was seeing pbs_server crash each time moab was
restarted. I was playing with moab REMAPCLASS and REMAPCLASSLIST
configurations. With REMAPCLASS disabled pbs_server did not crash.
guess: I had something queued which was being remapped upon moab
restart which would crash pbs_server (the jobs do not get remapped).
After restarting pbs_server things where okay and new jobs where
Additional note: There was a large array job with both running (~1500)
and queued (~1000) tasks. There may have been some confusion when the
queued tasks where attempted to be remapped.
Some more notes: I've just seen another instance of this problem. If
I submit several jobs quickly which need to be remapped pbs_server
will die. If there is only a single job needing to be remapped when
moab restarts pbs_server does not die and the remapping happens.
It looks like pbs_server dies if multiple remaps happen either two
quickly or simultaneously.
Queuing a single array job does not crash pbs_server. I see the
individual tasks get remapped over time.
Problem 2: pbs_server lost array job subtasks.
Somewhere while doing various server process restarts and
configuration changes (as above) pbs_server seems to have lost details
about all of the array job tasks. "qstat -a" still shows the main job
but "qstat -a -t" doesn't show anything about the array job.
The jobs still appear to be running but nothing seems to know about
Problem 3: I'm still learning and this may not be a real bug.
(I think) 2.4.7 was successfully copying output files from the compute
nodes back to the user login node. After upgrading to 2.5 beta this
stopped working correctly. The problem seemed to be that the hostname
in Output_Path is now the name of the public network interface instead
of the name of the private network. The compute nodes do not have any
routing to the public network so where not able to scp the files back.
It is possible I had some other configuration in 2.4.7 which made this
work and I lost the configuration when I upgraded to 2.5-beta.
My current work around has been to NAT the compute nodes to the public
side, but I don't want to do this long term.
I've just recently seen the pbs_mom man page update which describes
spool_as_final_name. This looks like something useful in our site as
the compute nodes do have shared user home directories. I'm still
looking for a good BCP on how a current day HPC cluster should be
I am going to be out of the office all next week so I need to stop
playing with this today and let things settle down for the users. I
won't be able to do any further testing. When I return I hope to
upgrade to the latest 2.5.X and adjust compute node configurations.
More information about the torqueusers