[torqueusers] PBS_NODEFILE: Undefined variable.

Sreedhar Manchu sm4082 at nyu.edu
Thu Feb 23 14:00:53 MST 2012


Hi Ken,

Unfortunately, it's happening again. It works for a while after the restart. But it starts happening all again after a while. File (/opt/torque/aux/PBS_JOBID) is there in  <torque home>/aux/ but some how PBS_NODEFILE is not getting initialized to this path. Lot of PBS variables are missing in environment. PBS_NODEFILE is one of them.

For now, we wrote a script asking users to source it. It checks for PBS_NODEFILE variable and if it doesn't exist then it assigns the path of <torque home>/aux/PBS_JOBID to it. Not sure whether we're the only ones facing or there is a problem with 2.5.10 version it self. Please let me know if you need any more information from my side in case you want to look into this problem.

Thanks,
Sreedhar.

On Feb 23, 2012, at 12:28 PM, Ken Nielson wrote:

> Sreedhar,
> 
> I'm glad it is working again. At least we know that the MOM gets into some state that is cleared up by a restart. Let us know if it happens again.
> 
> Ken
> 
> ----- Original Message -----
>> From: "Sreedhar Manchu" <sm4082 at nyu.edu>
>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Sent: Thursday, February 23, 2012 10:02:42 AM
>> Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable.
>> 
>> Ken,
>> 
>> I have restarted pbs_mom and now it works with #PBS -V in the script.
>> Not sure what exactly is happening.
>> 
>> Sreedhar.
>> 
>> On Feb 22, 2012, at 10:14 PM, Sreedhar Manchu wrote:
>> 
>>> Hi Ken,
>>> 
>>> First, thank you for your response. We see the problem with mpi
>>> jobs. When people use PBS_NODEFILE variable to define the hosts
>>> for mpiexec to run we are seeing this error. It doesn't happen all
>>> the time. I tried to test some simple jobs to see whether I can
>>> see this variable on some nodes and it was ok. Problem is that
>>> it's happening randomly. For whatever reasons this variable
>>> PBS_NODEFILE is getting initialized or defined in the job
>>> environment.
>>> 
>>> Whenever it happened we restarted pbs_moms and it was ok. At this
>>> point I'm not sure whether it's happening repeatedly on the same
>>> nodes. Now I'm waiting to see whether it happens on the same nodes
>>> again.
>>> 
>>> We see this error in .err files whenever people try to access
>>> PBS_NODEFILE. If we restart pbs_mom once the job fails with this
>>> error, another job with the same script works fine with out any
>>> errors.
>>> 
>>> For example, one user runs his job and at the end of his script he
>>> submits another job with this statement.
>>> 
>>> ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh"
>>> 
>>> We don't allow users to submit jobs from compute nodes. We have a
>>> submit filter in place on login nodes. This user pretty much
>>> submits the same job from the running job. Which means it ran well
>>> first time, but failed to do so because of above mentioned error.
>>> Another user gets exact error as in the subject when she tried to
>>> access PBS_NODEFILE as var=$PBS_NODEFILE.
>>> 
>>> First user tries to run his job with this statement.
>>> 
>>> /share/apps/openmpi/1.4.3/intel/bin/mpiexec \
>>> -n 144 -hostfile $PBS_NODEFILE \
>>> env OMP_NUM_THREADS=1 \
>>> /home/mitgcmuv
>>> 
>>> 
>>> We see this error in the .err file. As you can see, since it
>>> couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and
>>> fails.
>>> --------------------------------------------------------------------------
>>> Open RTE was unable to open the hostfile:
>>>   env
>>> Check to make sure the path and filename are correct.
>>> --------------------------------------------------------------------------
>>> 
>>> Once we restarted the pbs_mom same script worked fine. I'm not sure
>>> what's causing this. I don't see anything wrong either in torque
>>> logs or syslogs.
>>> 
>>> Today I made all nodes offline and plan to restart pbs_mom on all
>>> nodes hoping this would fix the issue forever, even though I doubt
>>> it might not. As we're not sure whether it is happening on the
>>> same nodes, restarting all nodes might give us an idea on this as
>>> well.
>>> 
>>> Please let us know if you have any thoughts on what might be
>>> happening with our case.
>>> 
>>> Thank you once again for your response and time.
>>> 
>>> Regards,
>>> Sreedhar.
>>> 
>>> On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote:
>>> 
>>>> ----- Original Message -----
>>>>> From: "Sreedhar Manchu" <sm4082 at nyu.edu>
>>>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>>>> Sent: Wednesday, February 22, 2012 11:22:16 AM
>>>>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable.
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Recently, I have upgraded Torque to it's 2.5.10 version. Since
>>>>> then
>>>>> we have been seeing this error "PBS_NODEFILE: Undefined
>>>>> variable.".
>>>>> If we restart pbs mom then everything works fine. Does anyone
>>>>> have
>>>>> any idea what's causing this behavior?
>>>>> 
>>>>> Please let me know if you need any information that could help in
>>>>> figuring out the problem.
>>>>> 
>>>>> Thanks,
>>>>> Sreedhar.
>>>> 
>>>> Sreedhar,
>>>> 
>>>> Is the error showing up at the console or in the log file?
>>>> 
>>>> Ken
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> 
>>> ---
>>> Sreedhar Manchu
>>> HPC Support Specialist
>>> New York University
>>> 251 Mercer Street
>>> New York, NY 10012-1110
>>> 
>>> 
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list