[torqueusers] Open MPI Torque Errors

Christopher Vaughan chris at clusterresources.com
Fri Aug 17 05:49:32 MDT 2007


Hi,

I'm having some issues getting Torque to play nice with OpenMPI.  I've 
followed the opmpi FAQ on how to install it with tm and I'm still 
getting errors.  I can run openmpi jobs fine but when I add torque into 
the mix it fails with the following errors.

[nuvolari31:13888] [0,0,0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 173
[nuvolari31:13888] pls:tm: failed to poll for a spawned proc, return 
status = 17002
[nuvolari31:13888] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c 
at line 462
[nuvolari31:13888] mpiexec: spawn failed with errno=-11

Torque Mom Log:

08/16/2007 10:15:30;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:15:32;0080;   
pbs_mom;Job;141.nuvolarib;scan_for_terminated: job 141.nuvolarib task 1 
terminated, sid 25553
08/16/2007 10:15:32;0008;   pbs_mom;Job;141.nuvolarib;job was terminated
08/16/2007 10:37:40;0008;   pbs_mom;Job;142.nuvolarib;Job Modified at 
request of PBS_Server at nuvolarib
08/16/2007 10:37:40;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
142.nuvolarib started, pid = 25808
08/16/2007 10:37:41;0008;   pbs_mom;Job;142.nuvolarib;kill_task: killing 
pid 25809 task 1 with sig 9
08/16/2007 10:37:41;0080;   
pbs_mom;Job;142.nuvolarib;scan_for_terminated: job 142.nuvolarib task 1 
terminated, sid 25808
08/16/2007 10:37:41;0008;   pbs_mom;Job;142.nuvolarib;job was terminated
08/16/2007 10:40:23;0008;   pbs_mom;Job;143.nuvolarib;Job Modified at 
request of PBS_Server at nuvolarib
08/16/2007 10:40:24;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
143.nuvolarib started, pid = 25969
08/16/2007 10:42:40;0008;   pbs_mom;Job;143.nuvolarib;kill_task: killing 
pid 25972 task 1 with sig 9
08/16/2007 10:42:40;0080;   
pbs_mom;Job;143.nuvolarib;scan_for_terminated: job 143.nuvolarib task 1 
terminated, sid 25969
08/16/2007 10:42:40;0008;   pbs_mom;Job;143.nuvolarib;job was terminated
08/16/2007 10:52:06;0008;   pbs_mom;Job;144.nuvolarib;Job Modified at 
request of PBS_Server at nuvolarib
08/16/2007 10:52:06;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
144.nuvolarib started, pid = 26198
08/16/2007 10:52:28;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:52:35;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:53:00;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:53:24;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:53:37;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:53:45;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:53:50;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:53:58;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:54:10;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:54:38;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) 
in tm_request, bad header Negative sign on an unsigned datum
08/16/2007 10:55:02;0008;   pbs_mom;Job;144.nuvolarib;kill_task: killing 
pid 26201 task 1 with sig 9
08/16/2007 10:55:02;0080;   
pbs_mom;Job;144.nuvolarib;scan_for_terminated: job 144.nuvolarib task 1 
terminated, sid 26198
08/16/2007 10:55:02;0008;   pbs_mom;Job;144.nuvolarib;job was terminated
08/16/2007 11:05:18;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
145.nuvolarib started, pid = 26435
08/16/2007 11:05:18;0008;   pbs_mom;Job;145.nuvolarib;Job Modified at 
request of PBS_Server at nuvolarib

08/14/2007 06:55:33;000d;PBS_Server;Job;116.nuvolarib;Post job file 
processing error; job 116.nuvolarib on host nuvolari31/0+nuvolari30/0
08/14/2007 06:55:33;000d;PBS_Server;Job;116.nuvolarib;request to copy 
stageout files failed on node 'nuvolari31/0+nuvolari30/0' for job 
116.nuvolarib

Simple jobs Like the following fail as well.

#!/bin/bash
cat hostname
cat $PBS_NODEFILE

Other info that may help.

#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch resources_available.nodect = 14
set queue batch keep_completed = 300
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server managers = root at nuvolarib
set server operators = root at nuvolarib
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 7
set server pbs_version = 2.1.8


Any ideas as to what to try next? 

I still think it's an error with datastaging between the two hosts, I've 
tried $usecp to use an nfs share and that gives me errors as well.


Thanks,

-- 
Chris Vaughan
EMEA Systems Engineer
Cluster Resources, Ltd.
Direct - UK Office:  +44 (0)1223 437 132
Mobile - +44 (0)7800 973 062
US Headquarters:  +1 801 717 3700
Skype: supercomputer1
www.clusterresources.co.uk




More information about the torqueusers mailing list