[torqueusers] PBS_NODEFILE: Undefined variable.
sm4082 at nyu.edu
Thu Feb 23 14:19:33 MST 2012
We also use environment modules. But the problem is that for parallel jobs mentioning "load module <module name>" loads modules on mother superior. But for mpiexec compiled with intel compilers, even mentioning load module <intel compiler> doesn't load required modules on the other nodes involved in parallel job. So we ask users to add "load module <intel compiler> to their .bashrc and doing #PBS -V exports the whole environment onto all the nodes and so everything works well.
Overall, I'm not sure whether there is a problem with this version 2.5.10 or there is a problem with the way I installed torque. But for now we wrote a script to check for PBS_NODEFILE variable and if it doesn't exist source the script so that the path of the file in /opt/torque/aux/PBS_JOBID to it. I am adding this to qsub wrapper so that it adds a line to source the script to user's pbs script at the end of pbs directives.
For now this works ok. Recently, tons of jobs failed because of this error. Hopefully, this fixes the problem.
Thanks again for writing.
On Feb 23, 2012, at 3:51 PM, <Gareth.Williams at csiro.au> wrote:
>> -----Original Message-----
>> From: Sreedhar Manchu [mailto:sm4082 at nyu.edu]
>> Sent: Friday, 24 February 2012 1:36 AM
>> To: Torque Users Mailing List
>> Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable.
>> Hi Ken,
>> This morning I did few more test runs and found out it's #PBS -V in
>> script that's causing problems. Whenever I include it I'm seeing this
>> I think it's trying to look for this variable in user's exported
>> environment rather than looking into variables defined by it self.
>> Everything works fine if I take -V off the script.
>> But the problem is some users need to mention this in their scripts to
>> export not only user defined variables but also system variables such
>> as LD_LIBRARY_PATH.
>> Is there a way to fix it? Now I'll do some more tests to see we have
>> same problem with other pbs variables such as PBS_O_WORKDIR etc.
>> Please let me know if you have any suggestions. I will send another
>> email once I do more testing.
> Hi Sreedhar,
> I see you are not sure that -V is the problem, but nevertheless I'd recommend to use -v with an explicit set of variables to be passed rather than -V. In most cases I'd say it is actually better to embed the variables in the script so it doesn't matter what environment you submit the job from - and use neither -v or -V. Many sites use 'environment modules' to set such variables.
> Good luck,
>> Sent from my phone. Please excuse my brevity and any typos.
>> On Feb 22, 2012, at 14:58, Ken Nielson <knielson at adaptivecomputing.com>
>>> ----- Original Message -----
>>>> From: "Sreedhar Manchu" <sm4082 at nyu.edu>
>>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>>> Sent: Wednesday, February 22, 2012 11:22:16 AM
>>>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable.
>>>> Recently, I have upgraded Torque to it's 2.5.10 version. Since then
>>>> we have been seeing this error "PBS_NODEFILE: Undefined variable.".
>>>> If we restart pbs mom then everything works fine. Does anyone have
>>>> any idea what's causing this behavior?
>>>> Please let me know if you need any information that could help in
>>>> figuring out the problem.
>>> Is the error showing up at the console or in the log file?
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers