[torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs?

Lloyd Brown lloyd_brown at byu.edu
Thu Aug 2 14:07:47 MDT 2012


Just so the community knows what happened here:

I have, thus far, been unable to get the v4.1.x pbs_mom's to reacquire
the already-running jobs.  I've also been unable to trace the exact
problem through the code myself.  I've also asked AdaptiveComputing via
a support ticket if this would be possible, and any help they could give
me in terms of what would be the likely reason for the problem, and this
was their answer, in part:

> It would be nice if this worked, but we're not supporting the migration of running jobs between major upgrades.

So, until I can figure out how to follow Torque code (which may take a
LONG time), I believe the answer is that this type of migration
(2.5.x->4.x) will require a full drain of the cluster.

I had hoped to avoid that, but right now, I just don't have the time to
fight this anymore.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 07/27/2012 09:29 AM, Lloyd Brown wrote:
> Well, so far, I have the pbs_server successfully upgrading and finding
> hte jobs, and the pbs_mom's upgrading, but for some reason the pbs_mom's
> are ignoring the running processes, and deleting the
> TORQUEHOME/mom_priv/jobs/*.{JB,SC,TK} files.  They don't delete the jobs
> on the server, or kill the running PIDs, though, so it's a little confusing.
> 
> 
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 07/27/2012 09:26 AM, David Beer wrote:
>> Lloyd,
>>
>> If you don't use cpusets then the change to hwloc won't affect you at
>> all. You have to configure cpusets on in order to use them. At any rate
>> I'm interested to hear about the results your test.
>>
>> David
>>
>> On Fri, Jul 27, 2012 at 9:22 AM, Lloyd Brown <lloyd_brown at byu.edu
>> <mailto:lloyd_brown at byu.edu>> wrote:
>>
>>     Interesting.  I didn't explicitly enable either one.  I don't know
>>     about the hwloc stuff in 4.x, but don't I have to explicitly compile
>>     in the cpusets code in 2.x?  I don't think I did, in this case.  Would
>>     that have an effect?
>>
>>     In any case, this is why I'm testing this on a staging cluster, not on
>>     our production system.  If I can pull it off, then we can upgrade
>>     without needing to drain the whole cluster.  If not, then we're no
>>     worse off than if I hadn't tried.
>>
>>     Lloyd Brown
>>     Systems Administrator
>>     Fulton Supercomputing Lab
>>     Brigham Young University
>>     http://marylou.byu.edu
>>
>>     On 07/26/2012 07:49 PM, Christopher Samuel wrote:
>>     > To be honest I wouldn't even try that, 4.x uses hwloc rather than
>>     > its own cpusets code so I don't believe it's worth the risk.
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>> -- 
>> David Beer | Software Engineer
>> Adaptive Computing
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list