[torqueusers] Upgrade

Ken Nielson knielson at adaptivecomputing.com
Mon Sep 20 14:17:36 MDT 2010


On 09/20/2010 01:56 PM, Gary Bowling wrote:
>    On 9/20/2010 10:20 AM, Ken Nielson wrote:
>    
>> On 09/20/2010 07:57 AM, Gary Bowling wrote:
>>      
>>> I am currently running version 2.1.9 and am looking to upgrade. Is the
>>> 2.5 chain ready for production or should I stick with 2.4.10 for now? My
>>> primary reason for upgrading is to use the HA feature of the server. I
>>> don't have any sophisticated scheduling or queue structures, just a
>>> basic queue balancing jobs across 16 nodes.
>>>
>>> Thanks,
>>>
>>> Gary
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>        
>> Gary,
>>
>> Either of these versions would be worth upgrading to. They both have the
>> enhanced high availability option. 2.5 adds full featured job array
>> support plus 2.5.3 will also have support for munge which will give you
>> a second option for authorizing users on the cluster which will not
>> require ruserok and privileged ports.
>>
>> If you have any more questions feel free to ask.
>>
>> Ken Nielson
>> Adaptive Computing
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>      
>
> As a follow up, I just upgraded my cluster and am getting these in my
> mom logs, any ideas? The jobs appear to be working ok, but not sure what
> "Unlink of job file failed" means or if it's a problem. Thanks, Gary
>
> 09/20/2010 19:41:12;0080;
> pbs_mom;Job;1511555.hcpwxcl02;scan_for_terminated: job 1511555.hcpwxcl02
> task 1 terminated, sid=30726
> 09/20/2010 19:41:12;0008;   pbs_mom;Job;1511555.hcpwxcl02;job was terminated
> 09/20/2010 19:41:12;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 09/20/2010 19:41:12;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
> of while loop
> 09/20/2010 19:41:12;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 09/20/2010 19:41:12;0080;   pbs_mom;Job;1511555.hcpwxcl02;obit sent to
> server
> 09/20/2010 19:41:21;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Permission
> denied (13) in job_purge, Unlink of job file failed
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>    
Gary,

do you have old job files still around in the mom_priv/jobs directory 
after this message? Specifically 1511555.

Ken


More information about the torqueusers mailing list