[torqueusers] mpiexec jobs got stuck
akohlmey at cmm.chem.upenn.edu
Wed May 13 11:53:45 MDT 2009
On Wed, 2009-05-13 at 12:22 -0400, Abhishek Gupta wrote:
> Hi Troy,
> I was able to fix the error message I mailed in my last mail, but the
> problem I explained in the beginning still exist, i.e. Job runs for a
> while and then stuck forever. Like I said it runs fine till node
> value=20 but beyond that it shows such behavior.
> Is there anything else I can try?
if the job runs for a bit and then stops, the problem is most likely
to be found in the MPI library or communication hardware. once a
job is started, torque has very little to do with what happens until
the job is finished. if this happens only with a larger number of nodes,
it can have two reasons: a) a specific node has a problem and that
does not get allocated for smaller jobs (assuming that nobody else
is running on the machine) or b) there is an overload problem due to
excessive communication. particularly some gigE switches crap out
at too high load in uncontrolled ways and many MPI implementations
have no provisions for that kind of behavior (corrupted data).
> Troy Baer wrote:
> > On Tue, 2009-05-12 at 17:03 -0400, Abhishek Gupta wrote:
> > > It is giving me an error:
> > > mpiexec: Error: get_hosts: pbs_statjob returned neither "ncpus" nor "nodect"
> > >
> > > Any suggestion?
> > >
> > What does your job script look like? How are you requesting nodes
> > and/or processors?
> > --Troy
> torqueusers mailing list
> torqueusers at supercluster.org
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
If you make something idiot-proof, the universe creates a better idiot.
More information about the torqueusers