[torqueusers] PBS job issue
Steve Young
chemadm at hamilton.edu
Fri Jan 16 10:04:13 MST 2009
Hi,
Is MPI compiled to be TM aware? Meaning if it is it would be able to
use the pbs_mom's to start and stop the mpd daemon's. When you check
the nodes which were assigned what do you mean actually? Assigned by
PBS or assigned by MPI? If MPI isn't compiled to be TM aware then
torque will assign nodes to the job but MPI won't use them and will
assign it's own list of nodes to the job. So like I mentioned before
even though torque tells you that it assigned the job to certain nodes
it might in fact be running on different nodes that MPI assigned. What
you need to do now is make sure your version of MPI is compiled to be
TM -aware. Search the archives of this list and you'll find it's a
common problem people encounter.
-Steve
On Jan 16, 2009, at 11:37 AM, Abhishek Gupta wrote:
> Hi Steve,
> You are right, it is MPI type of job. I checked the nodes which were
> assigned to the job and there was no job running. Even the job that
> should run in a few seconds, was totally stuck. Could you please
> tell me what should I do to solve this problem?
> Thanks,
> Abhishek.
>
> Steve Young wrote:
>> Hi,
>> I'm wondering if this is an MPI type of job? Did you make sure
>> to compile MPI to be TM-aware? How do you know the job is not
>> actually running somewhere? I've found that if you don't make MPI
>> aware of torque then the jobs end up on nodes MPI assigns and
>> doesn't run on the nodes torque assigns. I ended up using OSC's
>> version of mpiexec but using a version of MPI that can be compiled
>> to be TM aware would do the same thing. This is just a guess
>> without knowing what kind of job your running, what version of
>> torque you have, how you have things configured and such. Hope this
>> helps,
>>
>> -Steve
>>
>>
>> On Jan 16, 2009, at 11:13 AM, Abhishek Gupta wrote:
>>
>>> Hi all,
>>> I am facing a problem with job submission in which my first job
>>> gets stuck for ever( showing R state ) and if I run the same job
>>> keeping the first job, second job runs without any problem. I
>>> found that when I ask for more than 1 node, then only this problem
>>> arises. Even if I say nodes=1:ppn=2, it runs without any problem,
>>> but nodes=2 do not work for the first time. There is one thing
>>> that I found, even some other job( which require more than one
>>> node is stuck started by some other user), my job with requirement
>>> more than one node run smoothly while the job of that other user
>>> stays in that state forever.
>>> Could someone tell what could be the issue? Is there any parameter
>>> that need to be set?
>>> Thanks,
>>> Abhishek.
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
More information about the torqueusers
mailing list