[torqueusers] qsub/mpirun problems
Zhiliang Hu
zhu at iastate.edu
Wed Sep 17 20:06:15 MDT 2008
Sorry for cross posting -- I didn't get the problem solved on other lists:
We are running a Linux CentOS 8-node cluster. When "qsub" a mpiblast job, I came to this dilemma: what's the correct way to supply the nodes information: to "qsub" (-l nodes=6:ppn=2)? or to "mpirun" (-np 12 -machinefile /path/to/mpimachines)? Or both? --- they all failed in my trials (details below).
Any advice it appreciated.
Zhiliang
ps: My trials (they all on one-line; I break them down for visual purpose):
(1)
The following mpiblast runs fine on our CentOS cluster:
------------------------------------------------------
/path/to/bin/mpirun -np 12 -machinefile /path/to/mpimachines
/path/to/mpiblast
-p blastn
-d seq.db
-i /path/to/input.seq
-o /path/to/output.txt
------------------------------------------------------
(2)
When I try to send the job with 'qsub', it has problems:
--------------------------------------
qsub -l nodes=6:ppn=2
-e /path/to/locationA
-o /path/to/locationA
/path/to/program
where "program" is:
/path/to/bin/mpirun
/path/to/mpiblast
-p blastn
-d seq.db
-i /path/to/input.seq
-o /path/to/output.txt
--------------------------------------
The torque's "..ER" file says: "Sorry, mpiBLAST must be run on 3 or more nodes". (Also in the node's /undeliverred/ errors).
A SIDE NOTE: This worked before on this machine but for some weird reason it is failing now.
(3)
But if I specify node info like in:
--------------------------------------
qsub -l nodes=6:ppn=2
-e /path/to/locationA
-o /path/to/locationA
/path/to/program
where "program" is:
/path/to/bin/mpirun -np 12 -machinefile /path/to/mpimachines
/path/to/mpiblast
-p blastn
-d seq.db
-i /path/to/input.seq
-o /path/to/output.txt
--------------------------------------
It fails with error: "pls:tm: failed to poll for a spawned proc, return status = 17002".
-- what's the proper way to queue mpiblast jobs?
Zhiliang
More information about the torqueusers
mailing list