[torqueusers] PBS job issue
abhig at Princeton.EDU
Fri Jan 16 10:10:34 MST 2009
Thanks a lot Steve for all your help.
Steve Young wrote:
> Is MPI compiled to be TM aware? Meaning if it is it would be able
> to use the pbs_mom's to start and stop the mpd daemon's. When you
> check the nodes which were assigned what do you mean actually?
> Assigned by PBS or assigned by MPI? If MPI isn't compiled to be TM
> aware then torque will assign nodes to the job but MPI won't use them
> and will assign it's own list of nodes to the job. So like I mentioned
> before even though torque tells you that it assigned the job to
> certain nodes it might in fact be running on different nodes that MPI
> assigned. What you need to do now is make sure your version of MPI is
> compiled to be TM -aware. Search the archives of this list and you'll
> find it's a common problem people encounter.
> On Jan 16, 2009, at 11:37 AM, Abhishek Gupta wrote:
>> Hi Steve,
>> You are right, it is MPI type of job. I checked the nodes which were
>> assigned to the job and there was no job running. Even the job that
>> should run in a few seconds, was totally stuck. Could you please tell
>> me what should I do to solve this problem?
>> Steve Young wrote:
>>> I'm wondering if this is an MPI type of job? Did you make sure to
>>> compile MPI to be TM-aware? How do you know the job is not actually
>>> running somewhere? I've found that if you don't make MPI aware of
>>> torque then the jobs end up on nodes MPI assigns and doesn't run on
>>> the nodes torque assigns. I ended up using OSC's version of mpiexec
>>> but using a version of MPI that can be compiled to be TM aware would
>>> do the same thing. This is just a guess without knowing what kind of
>>> job your running, what version of torque you have, how you have
>>> things configured and such. Hope this helps,
>>> On Jan 16, 2009, at 11:13 AM, Abhishek Gupta wrote:
>>>> Hi all,
>>>> I am facing a problem with job submission in which my first job
>>>> gets stuck for ever( showing R state ) and if I run the same job
>>>> keeping the first job, second job runs without any problem. I found
>>>> that when I ask for more than 1 node, then only this problem
>>>> arises. Even if I say nodes=1:ppn=2, it runs without any problem,
>>>> but nodes=2 do not work for the first time. There is one thing that
>>>> I found, even some other job( which require more than one node is
>>>> stuck started by some other user), my job with requirement more
>>>> than one node run smoothly while the job of that other user stays
>>>> in that state forever.
>>>> Could someone tell what could be the issue? Is there any parameter
>>>> that need to be set?
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
More information about the torqueusers