[torqueusers] PBS job issue
Abhishek Gupta
abhig at Princeton.EDU
Fri Jan 16 13:33:40 MST 2009
Steve,
Do you have the links that explain the exact configuration options for
compiling it in a right way? The person who did it not here and so I
cannot contact him. Might be I have to try setting up the desired
configuration on some other computer and test it. I tried finding it on
the internet but couldn't find the proper link for all the details i
require to set it up properly.
Thanks,
Abhishek.
Steve Young wrote:
> Hi,
> No I mean building/compilng MPI so that you get those executables,
> mpiexec, mpicc, etc. Are you using openmpi? mpich? Making it TM aware
> is a configure option when compiling the version of MPI you have. When
> you compile mpi and make it TM aware it builds mpiexec so that it
> knows how to get the node information from torque and also allows
> torque to use the pbs_mom's to start and stop the mpd process.
> Otherwise, you'd need to worry about having to start mpd on each host
> you plan on running on. Someone must of compiled MPI on your system in
> order for you to have gotten mpiexec, mpicc, and so on. If you didn't
> build it then you'll need to find the person who did and ask them how
> it was compiled. Hope this helps =).
>
> -Steve
>
>
>
> On Jan 16, 2009, at 1:38 PM, Abhishek Gupta wrote:
>
>> Steve,
>> Could you tell me how to compile MPI to make it TM aware? If
>> compiling you mean to say using mpicc to compile C/C++ programs and
>> mpif70 for fortran, and then use mpirun command in the PBS script to
>> submit it as job, then I did this but my first job still stuck there
>> with following running with no problem.
>> Am I missing something here?
>> Thanks,
>> Abhishek.
>>
>> Steve Young wrote:
>>> Hi,
>>> Is MPI compiled to be TM aware? Meaning if it is it would be able
>>> to use the pbs_mom's to start and stop the mpd daemon's. When you
>>> check the nodes which were assigned what do you mean actually?
>>> Assigned by PBS or assigned by MPI? If MPI isn't compiled to be TM
>>> aware then torque will assign nodes to the job but MPI won't use
>>> them and will assign it's own list of nodes to the job. So like I
>>> mentioned before even though torque tells you that it assigned the
>>> job to certain nodes it might in fact be running on different nodes
>>> that MPI assigned. What you need to do now is make sure your version
>>> of MPI is compiled to be TM -aware. Search the archives of this list
>>> and you'll find it's a common problem people encounter.
>>>
>>> -Steve
>>>
>>>
>>> On Jan 16, 2009, at 11:37 AM, Abhishek Gupta wrote:
>>>
>>>> Hi Steve,
>>>> You are right, it is MPI type of job. I checked the nodes which
>>>> were assigned to the job and there was no job running. Even the job
>>>> that should run in a few seconds, was totally stuck. Could you
>>>> please tell me what should I do to solve this problem?
>>>> Thanks,
>>>> Abhishek.
>>>>
>>>> Steve Young wrote:
>>>>> Hi,
>>>>> I'm wondering if this is an MPI type of job? Did you make sure
>>>>> to compile MPI to be TM-aware? How do you know the job is not
>>>>> actually running somewhere? I've found that if you don't make MPI
>>>>> aware of torque then the jobs end up on nodes MPI assigns and
>>>>> doesn't run on the nodes torque assigns. I ended up using OSC's
>>>>> version of mpiexec but using a version of MPI that can be compiled
>>>>> to be TM aware would do the same thing. This is just a guess
>>>>> without knowing what kind of job your running, what version of
>>>>> torque you have, how you have things configured and such. Hope
>>>>> this helps,
>>>>>
>>>>> -Steve
>>>>>
>>>>>
>>>>> On Jan 16, 2009, at 11:13 AM, Abhishek Gupta wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I am facing a problem with job submission in which my first job
>>>>>> gets stuck for ever( showing R state ) and if I run the same job
>>>>>> keeping the first job, second job runs without any problem. I
>>>>>> found that when I ask for more than 1 node, then only this
>>>>>> problem arises. Even if I say nodes=1:ppn=2, it runs without any
>>>>>> problem, but nodes=2 do not work for the first time. There is one
>>>>>> thing that I found, even some other job( which require more than
>>>>>> one node is stuck started by some other user), my job with
>>>>>> requirement more than one node run smoothly while the job of that
>>>>>> other user stays in that state forever.
>>>>>> Could someone tell what could be the issue? Is there any
>>>>>> parameter that need to be set?
>>>>>> Thanks,
>>>>>> Abhishek.
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>
>
More information about the torqueusers
mailing list