[torqueusers] Three more possible 2.5 (beta) bugs

Glen Beane glen.beane at gmail.com
Thu Jul 22 09:19:56 MDT 2010


On Thu, Jul 22, 2010 at 11:03 AM, Stuart Barkley <stuartb at 4gh.net> wrote:
> Sorry to do this after the release, but our system is very new and I'm
> still getting configurations under control.  These possible issues are
> with torque-2.5-beta_0.20100702 (and Moab 5.4).  Looking at diffs to
> 2.5.0 I don't see anything that looks like fixes.
>
> Problem 1: pbs_server crash:
>
> For a while I was seeing pbs_server crash each time moab was
> restarted.  I was playing with moab REMAPCLASS and REMAPCLASSLIST
> configurations.  With REMAPCLASS disabled pbs_server did not crash.
>
> guess: I had something queued which was being remapped upon moab
> restart which would crash pbs_server (the jobs do not get remapped).
> After restarting pbs_server things where okay and new jobs where
> remapped correctly.
>
> Additional note: There was a large array job with both running (~1500)
> and queued (~1000) tasks.  There may have been some confusion when the
> queued tasks where attempted to be remapped.
>
> Some more notes: I've just seen another instance of this problem.  If
> I submit several jobs quickly which need to be remapped pbs_server
> will die.  If there is only a single job needing to be remapped when
> moab restarts pbs_server does not die and the remapping happens.
>
> It looks like pbs_server dies if multiple remaps happen either two
> quickly or simultaneously.
>
> Queuing a single array job does not crash pbs_server.  I see the
> individual tasks get remapped over time.
>
> Problem 2:  pbs_server lost array job subtasks.
>
> Somewhere while doing various server process restarts and
> configuration changes (as above) pbs_server seems to have lost details
> about all of the array job tasks.  "qstat -a" still shows the main job
> but "qstat -a -t" doesn't show anything about the array job.
>
> The jobs still appear to be running but nothing seems to know about
> them.
>
> Problem 3: I'm still learning and this may not be a real bug.
>
> (I think) 2.4.7 was successfully copying output files from the compute
> nodes back to the user login node.  After upgrading to 2.5 beta this
> stopped working correctly.  The problem seemed to be that the hostname
> in Output_Path is now the name of the public network interface instead
> of the name of the private network.  The compute nodes do not have any
> routing to the public network so where not able to scp the files back.
>
> It is possible I had some other configuration in 2.4.7 which made this
> work and I lost the configuration when I upgraded to 2.5-beta.
>
> My current work around has been to NAT the compute nodes to the public
> side, but I don't want to do this long term.
>
> I've just recently seen the pbs_mom man page update which describes
> spool_as_final_name.  This looks like something useful in our site as
> the compute nodes do have shared user home directories.  I'm still
> looking for a good BCP on how a current day HPC cluster should be
> setup.
>
> Final notes:
>
> I am going to be out of the office all next week so I need to stop
> playing with this today and let things settle down for the users.  I
> won't be able to do any further testing.  When I return I hope to
> upgrade to the latest 2.5.X and adjust compute node configurations.
>
> Stuart



I'll look at issue #2.  Someone at Adaptive will have to look at #1
since I don't have access to Moab at home where I do my TORQUE
development

for issue #3 you should be able to use the server_name server
attribute.  Set it to the hostname assigned to your private network
interface in qmgr:   qmgr -c "s s server_name = foo"


More information about the torqueusers mailing list