[torqueusers] qsub/mpirun problems

Glen Beane glen.beane at gmail.com
Thu Sep 18 05:22:17 MDT 2008


On Wed, Sep 17, 2008 at 10:06 PM, Zhiliang Hu <zhu at iastate.edu> wrote:

> Sorry for cross posting -- I didn't get the problem solved on other lists:
>
> We are running a Linux CentOS 8-node cluster. When "qsub" a mpiblast job, I
> came to this dilemma: what's the correct way to supply the nodes
> information: to "qsub" (-l nodes=6:ppn=2)? or to "mpirun" (-np 12
> -machinefile /path/to/mpimachines)?  Or both? --- they all failed in my
> trials (details below).
>
> Any advice it appreciated.
>
> Zhiliang
>
>
> ps: My trials (they all on one-line; I break them down for visual purpose):
>
> (1)
> The following mpiblast runs fine on our CentOS cluster:
> ------------------------------------------------------
>  /path/to/bin/mpirun -np 12 -machinefile /path/to/mpimachines
>    /path/to/mpiblast
>      -p blastn
>      -d seq.db
>      -i /path/to/input.seq
>      -o /path/to/output.txt
> ------------------------------------------------------
>
> (2)
> When I try to send the job with 'qsub', it has problems:
> --------------------------------------
> qsub -l nodes=6:ppn=2
>     -e /path/to/locationA
>     -o /path/to/locationA
>     /path/to/program
>
>  where "program" is:
>
>  /path/to/bin/mpirun
>    /path/to/mpiblast
>      -p blastn
>      -d seq.db
>      -i /path/to/input.seq
>      -o /path/to/output.txt
> --------------------------------------
> The torque's "..ER" file says: "Sorry, mpiBLAST must be run on 3 or more
> nodes". (Also in the node's /undeliverred/ errors).
>
> A SIDE NOTE: This worked before on this machine but for some weird reason
> it is failing now.
>
>
> (3)
> But if I specify node info like in:
> --------------------------------------
> qsub -l nodes=6:ppn=2
>     -e /path/to/locationA
>     -o /path/to/locationA
>     /path/to/program
>
>  where "program" is:
>
>  /path/to/bin/mpirun -np 12 -machinefile /path/to/mpimachines
>    /path/to/mpiblast
>      -p blastn
>      -d seq.db
>      -i /path/to/input.seq
>      -o /path/to/output.txt
> --------------------------------------
> It fails with error: "pls:tm: failed to poll for a spawned proc, return
> status = 17002".
>
> -- what's the proper way to queue mpiblast jobs?




(3) should work.  What MPI implementation do you use?  I would check all
your mom logs to try to find an error associated with that job - if you can
track the failure down to a specific node you might be able to diagnose it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080918/fbe6b424/attachment.html


More information about the torqueusers mailing list