[torqueusers] PBS_NODEFILE: Undefined variable.

Sreedhar Manchu sm4082 at nyu.edu
Wed Feb 22 20:14:00 MST 2012

Hi Ken,

First, thank you for your response. We see the problem with mpi jobs. When people use PBS_NODEFILE variable to define the hosts for mpiexec to run we are seeing this error. It doesn't happen all the time. I tried to test some simple jobs to see whether I can see this variable on some nodes and it was ok. Problem is that it's happening randomly. For whatever reasons this variable PBS_NODEFILE is getting initialized or defined in the job environment.

Whenever it happened we restarted pbs_moms and it was ok. At this point I'm not sure whether it's happening repeatedly on the same nodes. Now I'm waiting to see whether it happens on the same nodes again.

We see this error in .err files whenever people try to access PBS_NODEFILE. If we restart pbs_mom once the job fails with this error, another job with the same script works fine with out any errors.

For example, one user runs his job and at the end of his script he submits another job with this statement.

ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh"

We don't allow users to submit jobs from compute nodes. We have a submit filter in place on login nodes. This user pretty much submits the same job from the running job. Which means it ran well first time, but failed to do so because of above mentioned error. Another user gets exact error as in the subject when she tried to access PBS_NODEFILE as var=$PBS_NODEFILE.

First user tries to run his job with this statement. 

/share/apps/openmpi/1.4.3/intel/bin/mpiexec \
-n 144 -hostfile $PBS_NODEFILE \

We see this error in the .err file. As you can see, since it couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and fails.
Open RTE was unable to open the hostfile:
Check to make sure the path and filename are correct.

Once we restarted the pbs_mom same script worked fine. I'm not sure what's causing this. I don't see anything wrong either in torque logs or syslogs.

Today I made all nodes offline and plan to restart pbs_mom on all nodes hoping this would fix the issue forever, even though I doubt it might not. As we're not sure whether it is happening on the same nodes, restarting all nodes might give us an idea on this as well.

Please let us know if you have any thoughts on what might be happening with our case.

Thank you once again for your response and time.


On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote:

> ----- Original Message -----
>> From: "Sreedhar Manchu" <sm4082 at nyu.edu>
>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Sent: Wednesday, February 22, 2012 11:22:16 AM
>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable.
>> Hi,
>> Recently, I have upgraded Torque to it's 2.5.10 version. Since then
>> we have been seeing this error "PBS_NODEFILE: Undefined variable.".
>> If we restart pbs mom then everything works fine. Does anyone have
>> any idea what's causing this behavior?
>> Please let me know if you need any information that could help in
>> figuring out the problem.
>> Thanks,
>> Sreedhar.
> Sreedhar,
> Is the error showing up at the console or in the log file?
> Ken
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110

More information about the torqueusers mailing list