[torqueusers] Three more possible 2.5 (beta) bugs
glen.beane at gmail.com
Thu Jul 22 09:19:56 MDT 2010
On Thu, Jul 22, 2010 at 11:03 AM, Stuart Barkley <stuartb at 4gh.net> wrote:
> Sorry to do this after the release, but our system is very new and I'm
> still getting configurations under control. These possible issues are
> with torque-2.5-beta_0.20100702 (and Moab 5.4). Looking at diffs to
> 2.5.0 I don't see anything that looks like fixes.
> Problem 1: pbs_server crash:
> For a while I was seeing pbs_server crash each time moab was
> restarted. I was playing with moab REMAPCLASS and REMAPCLASSLIST
> configurations. With REMAPCLASS disabled pbs_server did not crash.
> guess: I had something queued which was being remapped upon moab
> restart which would crash pbs_server (the jobs do not get remapped).
> After restarting pbs_server things where okay and new jobs where
> remapped correctly.
> Additional note: There was a large array job with both running (~1500)
> and queued (~1000) tasks. There may have been some confusion when the
> queued tasks where attempted to be remapped.
> Some more notes: I've just seen another instance of this problem. If
> I submit several jobs quickly which need to be remapped pbs_server
> will die. If there is only a single job needing to be remapped when
> moab restarts pbs_server does not die and the remapping happens.
> It looks like pbs_server dies if multiple remaps happen either two
> quickly or simultaneously.
> Queuing a single array job does not crash pbs_server. I see the
> individual tasks get remapped over time.
> Problem 2: pbs_server lost array job subtasks.
> Somewhere while doing various server process restarts and
> configuration changes (as above) pbs_server seems to have lost details
> about all of the array job tasks. "qstat -a" still shows the main job
> but "qstat -a -t" doesn't show anything about the array job.
> The jobs still appear to be running but nothing seems to know about
> Problem 3: I'm still learning and this may not be a real bug.
> (I think) 2.4.7 was successfully copying output files from the compute
> nodes back to the user login node. After upgrading to 2.5 beta this
> stopped working correctly. The problem seemed to be that the hostname
> in Output_Path is now the name of the public network interface instead
> of the name of the private network. The compute nodes do not have any
> routing to the public network so where not able to scp the files back.
> It is possible I had some other configuration in 2.4.7 which made this
> work and I lost the configuration when I upgraded to 2.5-beta.
> My current work around has been to NAT the compute nodes to the public
> side, but I don't want to do this long term.
> I've just recently seen the pbs_mom man page update which describes
> spool_as_final_name. This looks like something useful in our site as
> the compute nodes do have shared user home directories. I'm still
> looking for a good BCP on how a current day HPC cluster should be
> Final notes:
> I am going to be out of the office all next week so I need to stop
> playing with this today and let things settle down for the users. I
> won't be able to do any further testing. When I return I hope to
> upgrade to the latest 2.5.X and adjust compute node configurations.
I'll look at issue #2. Someone at Adaptive will have to look at #1
since I don't have access to Moab at home where I do my TORQUE
for issue #3 you should be able to use the server_name server
attribute. Set it to the hostname assigned to your private network
interface in qmgr: qmgr -c "s s server_name = foo"
More information about the torqueusers