[torqueusers] TORQUE 2.5 beta is here!
stuartb at 4gh.net
Wed Jul 7 16:50:24 MDT 2010
On Fri, 2 Jul 2010 at 12:51 -0000, David Beer wrote:
> We're pleased to announce that TORQUE 2.5 beta is ready for testing.
I've done a little testing of this today on our system. I note a
couple of issues:
- There where about 30K instances of a 75K member job array still
queued when I did the upgrade. The pbs_server startup took 40 minutes
attempting to convert the old .JB files and then Moab did not seem to
see the array job at all.
Extract from /var/log/messages:
Jul 7 11:31:38 meadow-xcat PBS_Server: LOG_ERROR::No such file or directory (2) in pbsd_init, unable to read 3995.meadow-xcat-cn.fda.gov.AR
Jul 7 11:31:38 meadow-xcat PBS_Server: LOG_ERROR::No such file or directory (2) in pbsd_init, could not recover array-struct from file 3995.meadow-xcat-cn.fda.gov.AR--skipping. job array can not be recovered.
Jul 7 11:31:38 meadow-xcat PBS_Server: LOG_ERROR::job_recov, /var/spool/torque/server_priv/jobs/3995-296.meadow-xcat-cn.fda.gov.JB appears to be from an old version. Attempting to convert.
Jul 7 11:31:38 meadow-xcat PBS_Server: LOG_ERROR::job_qs_upgrade, backed up to /var/spool/torque/server_priv/jobs/3995-296.meadow-xcat-cn.fda.gov.BK
Followed by ~60K lines similar to the last two above.
Each restart of pbs_server to ~40 minutes with similar results.
I ended up reinitializing the job queue with "pbs_server -t cold"
(losing the remaining portion of the array job).
- Once I had a segfault:
Jul 7 18:19:31 meadow-xcat kernel: pbs_server: segfault at 0000000000000031 rip 00002aff3a7f45b8 rsp 00007fff313e5f80 error 4
This may have been when a down node came back online. I'll watch for
further instances of this.
- I would really like to see better packaging of Torque and Moab.
RPM build support would be an important addition (similar to the
recently announced debian support). This would allow tracking version
deployment with the package management software.
More information about the torqueusers