[torqueusers] PBS job issue

Abhishek Gupta abhig at Princeton.EDU
Fri Jan 16 13:33:40 MST 2009


Steve,
Do you have the links that explain the exact configuration options for 
compiling it in a right way? The person who did it not here and so I 
cannot contact him. Might be I have to try setting up the desired 
configuration on some other computer and test it. I tried finding it on 
the internet but couldn't find the proper link for all the details i 
require to set it up properly.
Thanks,
Abhishek.

Steve Young wrote:
> Hi,
>     No I mean building/compilng MPI so that you get those executables, 
> mpiexec, mpicc, etc. Are you using openmpi? mpich? Making it TM aware 
> is a configure option when compiling the version of MPI you have. When 
> you compile mpi and make it TM aware it builds mpiexec so that it 
> knows how to get the node information from torque and also allows 
> torque to use the pbs_mom's to start and stop the mpd process. 
> Otherwise, you'd need to worry about having to start mpd on each host 
> you plan on running on. Someone must of compiled MPI on your system in 
> order for you to have gotten mpiexec, mpicc, and so on. If you didn't 
> build it then you'll need to find the person who did and ask them how 
> it was compiled. Hope this helps =).
>
> -Steve
>
>
>
> On Jan 16, 2009, at 1:38 PM, Abhishek Gupta wrote:
>
>> Steve,
>> Could you tell me how to compile MPI to make it TM aware? If 
>> compiling you mean to say using mpicc to compile C/C++ programs and 
>> mpif70 for fortran, and then use mpirun command in the PBS script to 
>> submit it as job, then I did this but my first job still stuck there 
>> with following running with no problem.
>> Am I missing something here?
>> Thanks,
>> Abhishek.
>>
>> Steve Young wrote:
>>> Hi,
>>>    Is MPI compiled to be TM aware? Meaning if it is it would be able 
>>> to use the pbs_mom's to start and stop the mpd daemon's. When you 
>>> check the nodes which were assigned what do you mean actually? 
>>> Assigned by PBS or assigned by MPI? If MPI isn't compiled to be TM 
>>> aware then torque will assign nodes to the job but MPI won't use 
>>> them and will assign it's own list of nodes to the job. So like I 
>>> mentioned before even though torque tells you that it assigned the 
>>> job to certain nodes it might in fact be running on different nodes 
>>> that MPI assigned. What you need to do now is make sure your version 
>>> of MPI is compiled to be TM -aware. Search the archives of this list 
>>> and you'll find it's a common problem people encounter.
>>>
>>> -Steve
>>>
>>>
>>> On Jan 16, 2009, at 11:37 AM, Abhishek Gupta wrote:
>>>
>>>> Hi Steve,
>>>> You are right, it is MPI type of job. I checked the nodes which 
>>>> were assigned to the job and there was no job running. Even the job 
>>>> that should run in a few seconds, was totally stuck. Could you 
>>>> please tell me what should I do to solve this problem?
>>>> Thanks,
>>>> Abhishek.
>>>>
>>>> Steve Young wrote:
>>>>> Hi,
>>>>>   I'm wondering if this is an MPI type of job? Did you make sure 
>>>>> to compile MPI to be TM-aware? How do you know the job is not 
>>>>> actually running somewhere? I've found that if you don't make MPI 
>>>>> aware of torque then the jobs end up on nodes MPI assigns and 
>>>>> doesn't run on the nodes torque assigns. I ended up using OSC's 
>>>>> version of mpiexec but using a version of MPI that can be compiled 
>>>>> to be TM aware would do the same thing. This is just a guess 
>>>>> without knowing what kind of job your running, what version of 
>>>>> torque you have, how you have things configured and such. Hope 
>>>>> this helps,
>>>>>
>>>>> -Steve
>>>>>
>>>>>
>>>>> On Jan 16, 2009, at 11:13 AM, Abhishek Gupta wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I am facing a problem with job submission in which my first job 
>>>>>> gets stuck for ever( showing R state ) and if I run the same job 
>>>>>> keeping the first job, second job runs without any problem. I 
>>>>>> found that when I ask for more than 1 node, then only this 
>>>>>> problem arises. Even if I say nodes=1:ppn=2, it runs without any 
>>>>>> problem, but nodes=2 do not work for the first time. There is one 
>>>>>> thing that I found, even some other job( which require more than 
>>>>>> one node is stuck started by some other user), my job with 
>>>>>> requirement more than one node run smoothly while the job of that 
>>>>>> other user stays in that state forever.
>>>>>> Could someone tell what could be the issue? Is there any 
>>>>>> parameter that need to be set?
>>>>>> Thanks,
>>>>>> Abhishek.
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>
>


More information about the torqueusers mailing list