[torqueusers] PBS job issue

Steve Young chemadm at hamilton.edu
Fri Jan 16 10:04:13 MST 2009


Hi,
	Is MPI compiled to be TM aware? Meaning if it is it would be able to  
use the pbs_mom's to start and stop the mpd daemon's. When you check  
the nodes which were assigned what do you mean actually? Assigned by  
PBS or assigned by MPI? If MPI isn't compiled to be TM aware then  
torque will assign nodes to the job but MPI won't use them and will  
assign it's own list of nodes to the job. So like I mentioned before  
even though torque tells you that it assigned the job to certain nodes  
it might in fact be running on different nodes that MPI assigned. What  
you need to do now is make sure your version of MPI is compiled to be  
TM -aware. Search the archives of this list and you'll find it's a  
common problem people encounter.

-Steve


On Jan 16, 2009, at 11:37 AM, Abhishek Gupta wrote:

> Hi Steve,
> You are right, it is MPI type of job. I checked the nodes which were  
> assigned to the job and there was no job running. Even the job that  
> should run in a few seconds, was totally stuck. Could you please  
> tell me what should I do to solve this problem?
> Thanks,
> Abhishek.
>
> Steve Young wrote:
>> Hi,
>>    I'm wondering if this is an MPI type of job? Did you make sure  
>> to compile MPI to be TM-aware? How do you know the job is not  
>> actually running somewhere? I've found that if you don't make MPI  
>> aware of torque then the jobs end up on nodes MPI assigns and  
>> doesn't run on the nodes torque assigns. I ended up using OSC's  
>> version of mpiexec but using a version of MPI that can be compiled  
>> to be TM aware would do the same thing. This is just a guess  
>> without knowing what kind of job your running, what version of  
>> torque you have, how you have things configured and such. Hope this  
>> helps,
>>
>> -Steve
>>
>>
>> On Jan 16, 2009, at 11:13 AM, Abhishek Gupta wrote:
>>
>>> Hi all,
>>> I am facing a problem with job submission in which my first job  
>>> gets stuck for ever( showing R state ) and if I run the same job  
>>> keeping the first job, second job runs without any problem. I  
>>> found that when I ask for more than 1 node, then only this problem  
>>> arises. Even if I say nodes=1:ppn=2, it runs without any problem,  
>>> but nodes=2 do not work for the first time. There is one thing  
>>> that I found, even some other job( which require more than one  
>>> node is stuck started by some other user), my job with requirement  
>>> more than one node run smoothly while the job of that other user  
>>> stays in that state forever.
>>> Could someone tell what could be the issue? Is there any parameter  
>>> that need to be set?
>>> Thanks,
>>> Abhishek.
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>



More information about the torqueusers mailing list