[torqueusers] BUG: MOM segfaults
pw at osc.edu
Wed Feb 2 15:36:34 MST 2005
garrick at usc.edu wrote on Wed, 02 Feb 2005 14:26 -0800:
> After reading everything below and looking through the code some more. I still
> don't think that call to set_globid() is needed. Maybe it was needed with
> openpbs 2.3.12, but not with recent torques.
> In addition, I'm realizing that mpiexec still doesn't work after restarting a
> mom. I think the main reason is that the ji_stdout and ji_stderr port numbers
> aren't saved with the job, the restarted mom can't contact the original
> pbs_demux when a new TM_SPAWN request comes in.
> I'm still looking into this stuff, so I may be changing my mind as I sort
> everything out.
Thanks for figuring all this out. It would be great for torque users
to be able to restart the moms during an mpiexec job.
Your understanding is way beyond mine at this point---I haven't really
been following progress on torque. If there are any changes that should
go into mpiexec (including documentation), let me know and I'll fix it up.
More information about the torqueusers