[torqueusers] Wrong Exec Host
Regina Guilabert Canals
regina.guilabert at uib.es
Tue Sep 25 01:31:17 MDT 2007
Dear TORQUE users,
TORQUE seems to ignore the number of nodes we request with -l nodes=3
(or any number) and set the exec_host to fill just one node (both
processors).
The particular problem is:
Script submitted via qsub un_proc.pbs:
#!/bin/tcsh
### Job name
#PBS -N TEST
### Declare job non-rerunable
##PBS -r n
### Max run time
#PBS -l walltime=00:20:30
### Mail to user
#PBS -m ae
### Queue name (small, medium, long, verylong)
#PBS -q batch
### Number of nodes (node property ev67 wanted)
#PBS -l nodes=3:ppn=2
# This job's working directory
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
### Run NON-PARALLEL PROGRAM
awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'
###################### END of SCRIPT ######################
Report on qstat:
megacelula:~> qstat -n1
megacelula:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- ---
------ ----- - -----
3105.megacelula dfsvhs9 batch TEST 6027 3 --
-- 00:20 R 00:05 cell8/1+cell8/0
Extended report on this job shows the wrong exec_host assigned:
megacelula:~> qstat -f 3105
Job Id: 3105.megacelula
Job_Name = TEST
Job_Owner = dfsvhs9 at megacelula
resources_used.cput = 00:05:52
resources_used.mem = 3204kb
resources_used.vmem = 13796kb
resources_used.walltime = 00:05:53
job_state = R
queue = batch
server = megacelula
Checkpoint = u
ctime = Mon Sep 24 09:48:26 2007
Error_Path = megacelula:/megadisk/people/dfsvhs9/TEST.e3105
exec_host = cell8/1+cell8/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = ae
mtime = Mon Sep 24 09:48:27 2007
Output_Path = megacelula:/megadisk/people/dfsvhs9/TEST.o3105
Priority = 0
qtime = Mon Sep 24 09:48:26 2007
Rerunable = True
Resource_List.nodect = 3
Resource_List.nodes = 3:ppn=2
Resource_List.walltime = 00:20:30
session_id = 6027
Variable_List = PBS_O_HOME=/megadisk/people/dfsvhs9,
PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=dfsvhs9,
PBS_O_PATH=/home/mcidas/bin:/usr/local/ncarg/bin:./:;:/usr/
local/bin:
/usr/bin:/bin:/usr/bin/X11:/usr/games:;:/megadisk/people/
dfsvhs9/bin/b
in:;:/usr/pgi/linux86/6.0/bin:;:/usr/local/mpich/bin,
PBS_O_MAIL=/var/mail/dfsvhs9,PBS_O_SHELL=/bin/tcsh,
PBS_O_HOST=megacelula,PBS_O_WORKDIR=/megadisk/people/dfsvhs9,
PBS_O_QUEUE=batch
comment = Job started on Mon Sep 24 at 09:48
etime = Mon Sep 24 09:48:26 2007
Has anybody ever found this wierd behaviour?
We cannot use more than one node unless explicitly requested with "-l
nodes=[list of nodes]". Can anyone provide a hint on how to fix or
diagnose this problem further?
Thanks in advance.
PS: In case this helps, see server and queues configuration:
** Server configuration:
megacelula:~> qstat -Bf
Server: megacelula
server_state = Active
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1
Exiting:0
managers = torque at megacelula
operators = torque at megacelula
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_available.nodect = 28
resources_default.nodes = 2:ppn2
resources_assigned.nodect = 3
scheduler_iteration = 100
node_check_rate = 150
tcp_timeout = 6
node_pack = True
pbs_version = 2.1.8
** Queue configuration:
megacelula:~> qstat -Qf
Queue: debug
queue_type = Execution
Priority = 5
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
Exiting:0
max_running = 1
resources_max.walltime = 00:10:00
mtime = 1180373373
resources_available.nodect = 2
enabled = True
started = True
Queue: batch
queue_type = Execution
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1
Exiting:0
resources_max.walltime = 48:00:00
resources_default.walltime = 01:00:00
mtime = 1185525866
resources_available.nodect = 56
resources_assigned.nodect = 3
enabled = True
started = True
Regina Guilabert Canals
Grup de Meteorologia
Edif. Mateu Orfila Tel: +34 971 17 3213
Universitat de les Illes Balears Fax: +34 971 17 3426
07122 Palma de Mallorca (Spain) email: regina.guilabert at uib.es
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070925/9315f265/attachment-0001.html
More information about the torqueusers
mailing list