[torqueusers] torque 4.2.6 server Undefined attribute

David Beer dbeer at adaptivecomputing.com
Wed Nov 20 17:05:31 MST 2013


Martin,

We're sorry this wasn't caught beforehand. If you apply this patch to 4.2.6
it will work with the older moms:

4f9245b05bb0a296bbfacfca68c6807c6ddb1c39


On Wed, Nov 20, 2013 at 3:53 PM, Martin Siegert <siegert at sfu.ca> wrote:

> We ran into exactly the same problem: we needed to upgrade the server
> rightaway because of the security hole (CVE-2013-4495). We then
> planned rolling updates on all computenodes. As soon as we had
> updated the server the moms on the computenodes died with segmentation
> faults :-(
> Because of the security hole updating the server last is not really
> an option.
>
> Cheers,
> Martin
>
> On Wed, Nov 20, 2013 at 03:37:40PM -0700, Ken Nielson wrote:
> >
> >    If you are doing rolling upgrades keep your pbs_server at the earlier
> >    version and upgrade the pbs_moms to 4.2.6. After all of your MOMs have
> >    upgraded then move pbs_server to 4.2.6.
> >    This has been fixed and will be available in all upcoming TORQUE
> >    releases.
> >    Regards
> >
> >    On Wed, Nov 20, 2013 at 3:01 PM, Rick McKay
> >    <[1]rmckay at adaptivecomputing.com> wrote:
> >
> >    Eva,
> >    That's a defect. As soon as you upgrade your MOMs to 4.2.6, they'll
> >    start jobs. It's marked for correction in 4.2.7. Here's the changeset
> >    hash:
> >    345daa2..e3fb235 HEAD -> 4.2-dev
> >    Rick McKay | Technical Support Engineer
> >    Adaptive Computing
> >
> >    On Wed, Nov 20, 2013 at 2:54 PM, Eva Hocks <[2]hocks at sdsc.edu> wrote:
> >
> >      torque server 4.2.6 cannot start jobs on moms running 4.2.5 due to
> >      Undefined attribute  (15002) in send_job_work? Is this an expected
> >      behavior?
> >      11/20/2013 13:42:07;0040;PBS_Server.17972;Req;set_nodes;allocating
> >      nodes for job 206074.mskcc-fe1.local with node expression
> >      'gpu-1-4:ppn=10'
> >      11/20/2013 13:42:07;0040;PBS_Server.17972;Req;set_nodes;job
> >      206074.mskcc-fe1.local allocated 1 nodes
> >      (nodelist=gpu-1-4/0+gpu-1-4/1+gpu-1-4/2+gpu-1-4/3+gpu-1-4/4+gpu-1-4/
> >      5+gpu-1-4/6+gpu-1-4/7+gpu-1-4/8+gpu-1-4/9)
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;206074.mskcc-fe1.local;Job Run at
> >      request of root at mskcc-fe1.local
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;svr_setjobstate;svr_setjobstate:
> >      setting job 206074.mskcc-fe1.local state from QUEUED-QUEUED to
> >      RUNNING-PRERUN (4-40)
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;206074.mskcc-fe1.local;send of
> >      job to gpu-1-4 failed error = 15002
> >      11/20/2013
> >      13:42:07;0001;PBS_Server.17972;Svr;PBS_Server;LOG_ERROR::Undefined
> >      attribute  (15002) in send_job_work, child failed in previous commit
> >      request for job 206074.mskcc-fe1.local
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;206074.mskcc-fe1.local;entering
> >      finish_sendmom
> >      11/20/2013
> >      13:42:07;0002;PBS_Server.17972;Job;206074.mskcc-fe1.local;child
> >      reported failure for job after 0 seconds (dest=???), rc=-1
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;206074.mskcc-fe1.local;unable to
> >      run job, MOM rejected/rc=-1
> >      11/20/2013 13:42:07;0040;PBS_Server.17972;Req;free_nodes;freeing
> >      nodes for job 206074.mskcc-fe1.local
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;svr_setjobstate;svr_setjobstate:
> >      setting job 206074.mskcc-fe1.local state from RUNNING-TRNOUT to
> >      QUEUED-QUEUED (1-10)
> >      11/20/2013 13:42:07;0040;PBS_Server.17972;Req;free_nodes;freeing
> >      nodes for job 206074.mskcc-fe1.local
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;206074.mskcc-fe1.local;unable to
> >      run job, send to MOM '183508983' failed
> >      11/20/2013
> >      13:42:07;0008;PBS_Server.17972;Job;svr_setjobstate;svr_setjobstate:
> >      setting job 206074.mskcc-fe1.local state from QUEUED-QUEUED to
> >      QUEUED-QUEUED (1-10)
> >      Thanks
> >      Eva
> >      _______________________________________________
> >      torqueusers mailing list
> >      [3]torqueusers at supercluster.org
> >      [4]http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >      _______________________________________________
> >      torqueusers mailing list
> >      [5]torqueusers at supercluster.org
> >      [6]http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >    --
> >    Ken Nielson
> >    +1 801.717.3700 office +1 801.717.3738 fax
> >    1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> >    [7]www.adaptivecomputing.com
> >
> > References
> >
> >    1. mailto:rmckay at adaptivecomputing.com
> >    2. mailto:hocks at sdsc.edu
> >    3. mailto:torqueusers at supercluster.org
> >    4. http://www.supercluster.org/mailman/listinfo/torqueusers
> >    5. mailto:torqueusers at supercluster.org
> >    6. http://www.supercluster.org/mailman/listinfo/torqueusers
> >    7. http://www.adaptivecomputing.com/
>
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131120/26c8f198/attachment.html 


More information about the torqueusers mailing list