[torqueusers] PBS_NODEFILE: Undefined variable.

Ken Nielson knielson at adaptivecomputing.com
Thu Feb 23 10:28:21 MST 2012


Sreedhar,

I'm glad it is working again. At least we know that the MOM gets into some state that is cleared up by a restart. Let us know if it happens again.

Ken

----- Original Message -----
> From: "Sreedhar Manchu" <sm4082 at nyu.edu>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Thursday, February 23, 2012 10:02:42 AM
> Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable.
> 
> Ken,
> 
> I have restarted pbs_mom and now it works with #PBS -V in the script.
> Not sure what exactly is happening.
> 
> Sreedhar.
> 
> On Feb 22, 2012, at 10:14 PM, Sreedhar Manchu wrote:
> 
> > Hi Ken,
> > 
> > First, thank you for your response. We see the problem with mpi
> > jobs. When people use PBS_NODEFILE variable to define the hosts
> > for mpiexec to run we are seeing this error. It doesn't happen all
> > the time. I tried to test some simple jobs to see whether I can
> > see this variable on some nodes and it was ok. Problem is that
> > it's happening randomly. For whatever reasons this variable
> > PBS_NODEFILE is getting initialized or defined in the job
> > environment.
> > 
> > Whenever it happened we restarted pbs_moms and it was ok. At this
> > point I'm not sure whether it's happening repeatedly on the same
> > nodes. Now I'm waiting to see whether it happens on the same nodes
> > again.
> > 
> > We see this error in .err files whenever people try to access
> > PBS_NODEFILE. If we restart pbs_mom once the job fails with this
> > error, another job with the same script works fine with out any
> > errors.
> > 
> > For example, one user runs his job and at the end of his script he
> > submits another job with this statement.
> > 
> > ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh"
> > 
> > We don't allow users to submit jobs from compute nodes. We have a
> > submit filter in place on login nodes. This user pretty much
> > submits the same job from the running job. Which means it ran well
> > first time, but failed to do so because of above mentioned error.
> > Another user gets exact error as in the subject when she tried to
> > access PBS_NODEFILE as var=$PBS_NODEFILE.
> > 
> > First user tries to run his job with this statement.
> > 
> > /share/apps/openmpi/1.4.3/intel/bin/mpiexec \
> > -n 144 -hostfile $PBS_NODEFILE \
> > env OMP_NUM_THREADS=1 \
> > /home/mitgcmuv
> > 
> > 
> > We see this error in the .err file. As you can see, since it
> > couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and
> > fails.
> > --------------------------------------------------------------------------
> > Open RTE was unable to open the hostfile:
> >    env
> > Check to make sure the path and filename are correct.
> > --------------------------------------------------------------------------
> > 
> > Once we restarted the pbs_mom same script worked fine. I'm not sure
> > what's causing this. I don't see anything wrong either in torque
> > logs or syslogs.
> > 
> > Today I made all nodes offline and plan to restart pbs_mom on all
> > nodes hoping this would fix the issue forever, even though I doubt
> > it might not. As we're not sure whether it is happening on the
> > same nodes, restarting all nodes might give us an idea on this as
> > well.
> > 
> > Please let us know if you have any thoughts on what might be
> > happening with our case.
> > 
> > Thank you once again for your response and time.
> > 
> > Regards,
> > Sreedhar.
> > 
> > On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote:
> > 
> >> ----- Original Message -----
> >>> From: "Sreedhar Manchu" <sm4082 at nyu.edu>
> >>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >>> Sent: Wednesday, February 22, 2012 11:22:16 AM
> >>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable.
> >>> 
> >>> Hi,
> >>> 
> >>> Recently, I have upgraded Torque to it's 2.5.10 version. Since
> >>> then
> >>> we have been seeing this error "PBS_NODEFILE: Undefined
> >>> variable.".
> >>> If we restart pbs mom then everything works fine. Does anyone
> >>> have
> >>> any idea what's causing this behavior?
> >>> 
> >>> Please let me know if you need any information that could help in
> >>> figuring out the problem.
> >>> 
> >>> Thanks,
> >>> Sreedhar.
> >> 
> >> Sreedhar,
> >> 
> >> Is the error showing up at the console or in the log file?
> >> 
> >> Ken
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > ---
> > Sreedhar Manchu
> > HPC Support Specialist
> > New York University
> > 251 Mercer Street
> > New York, NY 10012-1110
> > 
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list