[torqueusers] Open MPI Torque Errors

Christopher Vaughan chris at clusterresources.com
Fri Aug 17 09:14:29 MDT 2007


Some more information that may be helpful:

[nuvolari31:15351] [0,0,0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 173
[nuvolari31:15351] pls:tm: final top-level argv:
[nuvolari31:15351] pls:tm:     orted --no-daemonize --bootproxy 1 
--name  --num_procs 2 --vpid_start 0 --nodename  --universe 
asdfeer at nuvolari31:default-universe-15351 --nsreplica 
"0.0.0;tcp://192.168.8.31:39662" --gprreplica 
"0.0.0;tcp://192.168.8.31:39662"
[nuvolari31:15351] pls:tm: found 
/home/asdfeer/OpenFOAM/openmpi-1.2.3/platforms/linuxGcc4SPOpt/bin/orted
[nuvolari31:15351] pls:tm: launching on node nuvolari31
[nuvolari31:15351] pls:tm: executing: orted --no-daemonize --bootproxy 1 
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename nuvolari31 
--universe asdfeer at nuvolari31:default-universe-15351 --nsreplica 
"0.0.0;tcp://192.168.8.31:39662" --gprreplica 
"0.0.0;tcp://192.168.8.31:39662"
[nuvolari31:15351] pls:tm:launch: finished spawning orteds
[nuvolari31:15351] pls:tm: failed to poll for a spawned proc, return 
status = 17002
[nuvolari31:15351] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c 
at line 462
[nuvolari31:15351] mpiexec: spawn failed with errno=-11

[aadfewr at nuvolarib fluent-test]$ qsub -I
qsub: waiting for job 165.nuvolarib to start
qsub: job 165.nuvolarib ready

Single Precision
[aadfewr at nuvolari31 ~]$ env |grep PBS
PBS_O_HOME=/home/aadfewr
PBS_O_LANG=en_GB.UTF-8
PBS_O_LOGNAME=aadfewr
PBS_O_PATH=/home/aadfewr/OpenFOAM/openmpi-1.2.3/platforms/linuxGcc4SPOpt/bin:/home/aadfewr/OpenFOAM/aadfewr-1.3.2/applications/bin/linuxGcc4SPOpt:/home/aadfewr/OpenFOAM/OpenFOAM-1.3.2/applications/bin/linuxGcc4SPOpt:/home/aadfewr/OpenFOAM/OpenFOAM-1.3.2/bin:/nfs/foamq/bigdisk/CODES/system-cfd/bin:/nfs/codehost/CODES/Sculptor/Sculptor-1.8.7//bin/RedHatEnterprise3:/nfs/codehost/CODES/Sculptor/Sculptor-1.8.5//bin/RedHatEnterprise3:/nfs/codehost/CODES/Sculptor/Sculptor-1.8//bin/RedHatEnterprise3:/nfs/codehost/CODES/Sculptor/Sculptor-1.7.5/Sculptor-1.7/bin/Suse9.1:/nfs/codehost/CODES/Sculptor/Sculptor-64bit/Sculptor-1.7/bin/Suse9.1-64:/nfs/codehost/CODES/Sculptor/Sculptor-1.7/bin/linux:/nfs/codehost/CODES/Sculptor/Sculptor-1.6.4/bin/linux:/nfs/codehost/CODES/Sculptor/Sculptor-1.6.5b/bin/linux:/nfs/codehost/CODES/Sculptor/Sculptor-1.6.5/bin/Suse9.1:/nfs/codehost/CODES/Sculptor/Sculptor-1.5.7/bin/linux:/nfs/codehost/CODES/crisp/crisp92/9.2.0a/bin.linux_glibc2.2:/nfs/codehost/CODES/crisp/crisp91/9.1.2c/bin.linux_glibc2.2:/nfs/codehost/CODES/Sculptor/Sculptor-1.0/bin/linux:/bin:/usr/bin:/usr/sbin:/usr/bin/X11:/etc:/usr/lib:/usr/local/bin:.:/nfs/codehost/CODES/public/bin:/nfs/codehost/CODES/acfd/bin:/nfs/codehost/CODES/acfd/bin/Linux:/nfs/codehost/CODES/qt-linux/bin:/nfs/codehost/CODES/ensight/bin:/nfs/codehost/CODES/fluent/tgrid3.4.2/bin:/usr/local/sbin:/nfs/codehost/CODES/gridgen/gridgen15:/nfs/codehost/CODES/mses/command.dir:/nfs/codehost/CODES/crisp/crisp80/bin.irix6:/nfs/codehost/CODES/harpoon/bin:/nfs/codehost/CODES/public/lib
PBS_O_MAIL=/var/spool/mail/aadfewr
PBS_O_SHELL=/bin/csh
PBS_O_HOST=nuvolarib
PBS_O_WORKDIR=/home/fluent-test
PBS_O_QUEUE=batch
PBS_JOBNAME=STDIN
PBS_JOBID=165.nuvolarib
PBS_QUEUE=batch
PBS_JOBCOOKIE=459A7A2FE08EE09DD2184654CEE5C4BE
PBS_NODENUM=0
PBS_TASKNUM=1
PBS_MOMPORT=15003
PBS_NODEFILE=/home/fluent-hosts
PBS_VNODENUM=0
PBS_ENVIRONMENT=PBS_INTERACTIVE
[aadfewr at nuvolari31 ~]$





Christopher Vaughan wrote:
> Hi,
>
> I'm having some issues getting Torque to play nice with OpenMPI.  I've 
> followed the opmpi FAQ on how to install it with tm and I'm still 
> getting errors.  I can run openmpi jobs fine but when I add torque 
> into the mix it fails with the following errors.
>
> [nuvolari31:13888] [0,0,0] ORTE_ERROR_LOG: File open failure in file 
> ras_tm_module.c at line 173
> [nuvolari31:13888] pls:tm: failed to poll for a spawned proc, return 
> status = 17002
> [nuvolari31:13888] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c 
> at line 462
> [nuvolari31:13888] mpiexec: spawn failed with errno=-11
>
> Torque Mom Log:
>
> 08/16/2007 10:15:30;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:15:32;0080;   
> pbs_mom;Job;141.nuvolarib;scan_for_terminated: job 141.nuvolarib task 
> 1 terminated, sid 25553
> 08/16/2007 10:15:32;0008;   pbs_mom;Job;141.nuvolarib;job was terminated
> 08/16/2007 10:37:40;0008;   pbs_mom;Job;142.nuvolarib;Job Modified at 
> request of PBS_Server at nuvolarib
> 08/16/2007 10:37:40;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> 142.nuvolarib started, pid = 25808
> 08/16/2007 10:37:41;0008;   pbs_mom;Job;142.nuvolarib;kill_task: 
> killing pid 25809 task 1 with sig 9
> 08/16/2007 10:37:41;0080;   
> pbs_mom;Job;142.nuvolarib;scan_for_terminated: job 142.nuvolarib task 
> 1 terminated, sid 25808
> 08/16/2007 10:37:41;0008;   pbs_mom;Job;142.nuvolarib;job was terminated
> 08/16/2007 10:40:23;0008;   pbs_mom;Job;143.nuvolarib;Job Modified at 
> request of PBS_Server at nuvolarib
> 08/16/2007 10:40:24;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> 143.nuvolarib started, pid = 25969
> 08/16/2007 10:42:40;0008;   pbs_mom;Job;143.nuvolarib;kill_task: 
> killing pid 25972 task 1 with sig 9
> 08/16/2007 10:42:40;0080;   
> pbs_mom;Job;143.nuvolarib;scan_for_terminated: job 143.nuvolarib task 
> 1 terminated, sid 25969
> 08/16/2007 10:42:40;0008;   pbs_mom;Job;143.nuvolarib;job was terminated
> 08/16/2007 10:52:06;0008;   pbs_mom;Job;144.nuvolarib;Job Modified at 
> request of PBS_Server at nuvolarib
> 08/16/2007 10:52:06;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> 144.nuvolarib started, pid = 26198
> 08/16/2007 10:52:28;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:52:35;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:53:00;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:53:24;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:53:37;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:53:45;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:53:50;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:53:58;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:54:10;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:54:38;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor 
> (9) in tm_request, bad header Negative sign on an unsigned datum
> 08/16/2007 10:55:02;0008;   pbs_mom;Job;144.nuvolarib;kill_task: 
> killing pid 26201 task 1 with sig 9
> 08/16/2007 10:55:02;0080;   
> pbs_mom;Job;144.nuvolarib;scan_for_terminated: job 144.nuvolarib task 
> 1 terminated, sid 26198
> 08/16/2007 10:55:02;0008;   pbs_mom;Job;144.nuvolarib;job was terminated
> 08/16/2007 11:05:18;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> 145.nuvolarib started, pid = 26435
> 08/16/2007 11:05:18;0008;   pbs_mom;Job;145.nuvolarib;Job Modified at 
> request of PBS_Server at nuvolarib
>
> 08/14/2007 06:55:33;000d;PBS_Server;Job;116.nuvolarib;Post job file 
> processing error; job 116.nuvolarib on host nuvolari31/0+nuvolari30/0
> 08/14/2007 06:55:33;000d;PBS_Server;Job;116.nuvolarib;request to copy 
> stageout files failed on node 'nuvolari31/0+nuvolari30/0' for job 
> 116.nuvolarib
>
> Simple jobs Like the following fail as well.
>
> #!/bin/bash
> cat hostname
> cat $PBS_NODEFILE
>
> Other info that may help.
>
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch resources_available.nodect = 14
> set queue batch keep_completed = 300
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server managers = root at nuvolarib
> set server operators = root at nuvolarib
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server log_level = 7
> set server pbs_version = 2.1.8
>
>
> Any ideas as to what to try next?
> I still think it's an error with datastaging between the two hosts, 
> I've tried $usecp to use an nfs share and that gives me errors as well.
>
>
> Thanks,
>


-- 
Chris Vaughan
EMEA Systems Engineer
Cluster Resources, Ltd.
Direct - UK Office:  +44 (0)1223 437 132
Mobile - +44 (0)7800 973 062
US Headquarters:  +1 801 717 3700
Skype: supercomputer1
www.clusterresources.co.uk





More information about the torqueusers mailing list