[torqueusers] Torque with OpenMPI
Jozef Káčer
quickparser at gmail.com
Thu Feb 21 10:09:42 MST 2008
This was great Craig! I would never tell that the code might be buggy.
I copied the recompiled binary to all my nodes (I don't still have NFS).
Now, when I run the code like this:
q-parser at f135-3:~$ mpirun -np 7 --hostfile zoznam test_app
0(f135-4): We have 7 processors
0(f135-4): Hello 1! Processor 1 (f135-5) reporting for duty
0(f135-4): Hello 2! Processor 2 (f135-6) reporting for duty
0(f135-4): Hello 3! Processor 3 (f135-7) reporting for duty
0(f135-4): Hello 4! Processor 4 (f135-8) reporting for duty
0(f135-4): Hello 5! Processor 5 (f135-9) reporting for duty
0(f135-4): Hello 6! Processor 6 (f135-11) reporting for duty
It seems to me that one processor is still lost, but I have no bug info with
this.
However, when I run it using torque, the job seems to be hung. 'showq' shows
that the job is running but never finishes.
q-parser at f135-3:~$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
113 q-parser Running 7 00:49:29 Thu Feb 21
17:56:04
1 Active Job 7 of 22 Processors Active (31.82%)
4 of 11 Nodes Active (36.36%)
...
My script looks like this:
#!/bin/bash
#PBS -N test_job
#PBS -q batch
#PBS -l nodes=7
#PBS -l cput=00:02:00
cd
mpirun ./test_app
exit 0
All my nodes are running now. qstat -f tells me that the job was assigned to
these hosts:
exec_host =
f135-15/1+f135-15/0+f135-14/1+f135-14/0+f135-13/1+f135-13/0+f1
35-12/0
I'm thankful for your time and effort.
On Thu, Feb 21, 2008 at 5:37 PM, Craig West <cwest at astro.umass.edu> wrote:
>
> Jozef,
>
> It is buggy code. The simple problem is that idstr is only 32 chars.
> When you sprintf the long string at line 45 of the code you are writing
> past the end of the idstr buffer, segfaults and like will occur. Change
> the size of idstr to be 64 and try again. Don't go too much bigger than
> 64 as you will cause problems with BUFSIZE.
>
> I should note that it crashed here when I ran it, works fine with the
> idstr[64].
>
> > If anybody might know of anything that could help me I'm listening.
>
> Craig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080221/71b2a66e/attachment-0001.html
More information about the torqueusers
mailing list