[torqueusers] PBS job issue

Abhishek Gupta abhig at Princeton.EDU
Fri Jan 16 10:10:34 MST 2009


Thanks a lot Steve for all your help.
Abhishek.

Steve Young wrote:
> Hi,
>     Is MPI compiled to be TM aware? Meaning if it is it would be able 
> to use the pbs_mom's to start and stop the mpd daemon's. When you 
> check the nodes which were assigned what do you mean actually? 
> Assigned by PBS or assigned by MPI? If MPI isn't compiled to be TM 
> aware then torque will assign nodes to the job but MPI won't use them 
> and will assign it's own list of nodes to the job. So like I mentioned 
> before even though torque tells you that it assigned the job to 
> certain nodes it might in fact be running on different nodes that MPI 
> assigned. What you need to do now is make sure your version of MPI is 
> compiled to be TM -aware. Search the archives of this list and you'll 
> find it's a common problem people encounter.
>
> -Steve
>
>
> On Jan 16, 2009, at 11:37 AM, Abhishek Gupta wrote:
>
>> Hi Steve,
>> You are right, it is MPI type of job. I checked the nodes which were 
>> assigned to the job and there was no job running. Even the job that 
>> should run in a few seconds, was totally stuck. Could you please tell 
>> me what should I do to solve this problem?
>> Thanks,
>> Abhishek.
>>
>> Steve Young wrote:
>>> Hi,
>>>    I'm wondering if this is an MPI type of job? Did you make sure to 
>>> compile MPI to be TM-aware? How do you know the job is not actually 
>>> running somewhere? I've found that if you don't make MPI aware of 
>>> torque then the jobs end up on nodes MPI assigns and doesn't run on 
>>> the nodes torque assigns. I ended up using OSC's version of mpiexec 
>>> but using a version of MPI that can be compiled to be TM aware would 
>>> do the same thing. This is just a guess without knowing what kind of 
>>> job your running, what version of torque you have, how you have 
>>> things configured and such. Hope this helps,
>>>
>>> -Steve
>>>
>>>
>>> On Jan 16, 2009, at 11:13 AM, Abhishek Gupta wrote:
>>>
>>>> Hi all,
>>>> I am facing a problem with job submission in which my first job 
>>>> gets stuck for ever( showing R state ) and if I run the same job 
>>>> keeping the first job, second job runs without any problem. I found 
>>>> that when I ask for more than 1 node, then only this problem 
>>>> arises. Even if I say nodes=1:ppn=2, it runs without any problem, 
>>>> but nodes=2 do not work for the first time. There is one thing that 
>>>> I found, even some other job( which require more than one node is 
>>>> stuck started by some other user), my job with requirement more 
>>>> than one node run smoothly while the job of that other user stays 
>>>> in that state forever.
>>>> Could someone tell what could be the issue? Is there any parameter 
>>>> that need to be set?
>>>> Thanks,
>>>> Abhishek.
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>


More information about the torqueusers mailing list