[torqueusers] Upgrade
Gary Bowling
gb at gbco.us
Mon Sep 20 13:56:17 MDT 2010
On 9/20/2010 10:20 AM, Ken Nielson wrote:
> On 09/20/2010 07:57 AM, Gary Bowling wrote:
>> I am currently running version 2.1.9 and am looking to upgrade. Is the
>> 2.5 chain ready for production or should I stick with 2.4.10 for now? My
>> primary reason for upgrading is to use the HA feature of the server. I
>> don't have any sophisticated scheduling or queue structures, just a
>> basic queue balancing jobs across 16 nodes.
>>
>> Thanks,
>>
>> Gary
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> Gary,
>
> Either of these versions would be worth upgrading to. They both have the
> enhanced high availability option. 2.5 adds full featured job array
> support plus 2.5.3 will also have support for munge which will give you
> a second option for authorizing users on the cluster which will not
> require ruserok and privileged ports.
>
> If you have any more questions feel free to ask.
>
> Ken Nielson
> Adaptive Computing
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
As a follow up, I just upgraded my cluster and am getting these in my
mom logs, any ideas? The jobs appear to be working ok, but not sure what
"Unlink of job file failed" means or if it's a problem. Thanks, Gary
09/20/2010 19:41:12;0080;
pbs_mom;Job;1511555.hcpwxcl02;scan_for_terminated: job 1511555.hcpwxcl02
task 1 terminated, sid=30726
09/20/2010 19:41:12;0008; pbs_mom;Job;1511555.hcpwxcl02;job was terminated
09/20/2010 19:41:12;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
09/20/2010 19:41:12;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
09/20/2010 19:41:12;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
09/20/2010 19:41:12;0080; pbs_mom;Job;1511555.hcpwxcl02;obit sent to
server
09/20/2010 19:41:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Permission
denied (13) in job_purge, Unlink of job file failed
More information about the torqueusers
mailing list