[torqueusers] Three more possible 2.5 (beta) bugs
dbeer at adaptivecomputing.com
Thu Jul 22 09:21:57 MDT 2010
----- Original Message -----
> Sorry to do this after the release, but our system is very new and I'm
> still getting configurations under control. These possible issues are
> with torque-2.5-beta_0.20100702 (and Moab 5.4). Looking at diffs to
> 2.5.0 I don't see anything that looks like fixes.
> Problem 1: pbs_server crash:
> For a while I was seeing pbs_server crash each time moab was
> restarted. I was playing with moab REMAPCLASS and REMAPCLASSLIST
> configurations. With REMAPCLASS disabled pbs_server did not crash.
> guess: I had something queued which was being remapped upon moab
> restart which would crash pbs_server (the jobs do not get remapped).
> After restarting pbs_server things where okay and new jobs where
> remapped correctly.
> Additional note: There was a large array job with both running (~1500)
> and queued (~1000) tasks. There may have been some confusion when the
> queued tasks where attempted to be remapped.
> Some more notes: I've just seen another instance of this problem. If
> I submit several jobs quickly which need to be remapped pbs_server
> will die. If there is only a single job needing to be remapped when
> moab restarts pbs_server does not die and the remapping happens.
> It looks like pbs_server dies if multiple remaps happen either two
> quickly or simultaneously.
> Queuing a single array job does not crash pbs_server. I see the
> individual tasks get remapped over time.
I'll try to reproduce these errors, but do you happen to have a core file or a backtrace from any of these crashes?
David Beer | Senior Software Engineer
More information about the torqueusers