[Mauiusers] Parallel Job with more than one compute node doesn't start!

Dr. Stephan Raub raub at uni-duesseldorf.de
Fri Dec 11 04:37:05 MST 2009


Hello everyone

I have a strange problem with jobs, which want to use more than one
computenode. I tried a simple test-skript like:


#PBS -l walltime=00:20:00
#PBS -l nodes=2:ppn=1
#PBS -N testjob5

It stays in the queue with status “Q” for eternity. qstat –f shows an
increasing number in “start_count” and “exit_status = -3”. I found out, that
the scheduler (maui) already assigned this job to two nodes. I set
$logevent=255 and $loglevel=7 for this two nodes (node2 and node3) and found
the relevant parts, which you can find below.

Jobs with #PBS –l nodes=1:ppn=4 start normally on 4 Cores of one node.

Please, I would really welcome any help you can give.

Thank You in advance.

Stephan


Log of Node2
------------
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;ready to commit job
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;ready to commit job completed
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
Commit from PBS_Server
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type Commit
from host .blabla.cluster received
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type Commit
from host .blabla.cluster allowed
12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
Commit on sd=10
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;committing job
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;starting job execution
12/11/2009 14:03:53;0001;   pbs_mom;Job;job_nodes;0: .blabla-2/0
12/11/2009 14:03:53;0001;   pbs_mom;Job;job_nodes;1: .blabla-3/0
12/11/2009 14:03:53;0001;   pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2
numvnod=2
12/11/2009 14:03:53;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
pre-sigprocmask
12/11/2009 14:03:53;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
post-initgroups
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;job 78.xxx reported
successful start on 2 node(s)
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;job execution started
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
Disconnect from PBS_Server
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
StatusJob from PBS_Server
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
StatusJob from host .blabla.cluster received
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
StatusJob from host .blabla.cluster allowed
12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
StatusJob on sd=10
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
ModifyJob from PBS_Server
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
ModifyJob from host .blabla.cluster received
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
ModifyJob from host .blabla.cluster allowed
12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
ModifyJob on sd=14
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;modifying job
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;modifying type 6 attribute
session_id of job (value: '???')
12/11/2009 14:03:53;0002;
pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) entered
12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
attribute 'ncpus'
12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
attribute 'neednodes'
12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
attribute 'nodes'
12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
attribute 'walltime'
12/11/2009 14:03:53;0002;
pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) completed
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;Job Modified at request of
PBS_Server at .blabla.cluster
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
Disconnect from PBS_Server
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
Disconnect from PBS_Server
12/11/2009 14:03:53;0008;   pbs_mom;Job;do_rpp;got an internal task manager
request in do_rpp
12/11/2009 14:03:53;0002;   pbs_mom;Svr;im_request;connect from
192.168.1.63:15003
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;received request 'ERROR' for
job 78.xxx from 192.168.1.63:15003
12/11/2009 14:03:53;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad UID for job
execution (15023) in 78.xxx, job_start_error from node 192.168.1.63:15003 in
job_start_error
12/11/2009 14:03:53;0008;   pbs_mom;Req;send_sisters;sending command
ABORT_JOB for job 78.xxx (10)
12/11/2009 14:03:53;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters
12/11/2009 14:03:53;0080;   pbs_mom;Svr;scan_for_exiting;searching for
exiting jobs
12/11/2009 14:03:53;0008;   pbs_mom;Job;kill_job;scan_for_exiting: sending
signal 9, "KILL" to job 78.xxx, reason: local task termination detected
12/11/2009 14:03:53;0002;   pbs_mom;n/a;run_pelog;userepilog script
'/var/spool/torque/mom_priv/epilogue.precancel' for job 78.xxx does not
exist (cwd: /var/spool/torque/mom_priv,pid: 12854)
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;kill_job done (killed 0
processes)
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;sending preobit jobstat
12/11/2009 14:03:53;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
12/11/2009 14:03:53;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
12/11/2009 14:03:53;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;performing job clean-up
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;epilog subtask created with
pid 12858 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue
12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_close_poll;entered
12/11/2009 14:03:53;0008;   pbs_mom;Job;scan_for_terminated;entered
12/11/2009 14:03:53;0080;   pbs_mom;Svr;mom_get_sample;proc_array load
started
12/11/2009 14:03:53;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded -
nproc=194
12/11/2009 14:03:53;0080;   pbs_mom;n/a;cput_sum;proc_array loop start -
jobid = 78.xxx
12/11/2009 14:03:53;0080;   pbs_mom;n/a;mem_sum;proc_array loop start -
jobid = 78.xxx
12/11/2009 14:03:53;0080;   pbs_mom;n/a;resi_sum;proc_array loop start -
jobid = 78.xxx
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;checking job w/subtask
pid=12858 (child pid=12858)
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;checking job post-processing
routine
12/11/2009 14:03:53;0080;   pbs_mom;Req;post_epilogue;preparing obit message
for job 78.xxx
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;encoding "send flagged" attr:
Error_Path
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;obit sent to server
12/11/2009 14:03:53;0001;   pbs_mom;Job;78.xxx;setting job substate to
EXITED
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
DeleteJob from PBS_Server
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
DeleteJob from host .blabla.cluster received
12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
DeleteJob from host .blabla.cluster allowed
12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
DeleteJob on sd=10
12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;deleting job
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;deleting job 78.xxx in state
EXITED
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;removing job
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;removed job script
12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;removed job file
12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
Disconnect from PBS_Server



Log of Node3
------------
12/11/2009 14:03:40;0008;   pbs_mom;Job;do_rpp;got an internal task manager
request in do_rpp
12/11/2009 14:03:40;0002;   pbs_mom;Svr;im_request;connect from
192.168.1.62:1022
12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;received request 'JOIN_JOB'
for job 78.xxx from 192.168.1.62:1022
12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;im_request: JOIN_JOB 78.xxx
node 1
12/11/2009 14:03:40;0001;   pbs_mom;Job;job_nodes;0: .blabla-2/0
12/11/2009 14:03:40;0001;   pbs_mom;Job;job_nodes;1: .blabla-3/0
12/11/2009 14:03:40;0001;   pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2
numvnod=2
12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;no group entry for group
admin, user=raub, errno=0 (Success)
12/11/2009 14:03:40;0080;   pbs_mom;Job;78.xxx;removing job
12/11/2009 14:03:40;0008;   pbs_mom;Job;do_rpp;got an internal task manager
request in do_rpp
12/11/2009 14:03:40;0002;   pbs_mom;Svr;im_request;connect from
192.168.1.62:1022
12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;received request 'ABORT_JOB'
for job 78.xxx from 192.168.1.62:1022
12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;ERROR:    received request
'ABORT_JOB' from 192.168.1.62:1022 for job '78.xxx' (job does not exist
locally)


Output of qmgr “ps”
-------------------
create queue rhel                                  
set queue rhel queue_type = Execution              
set queue rhel from_route_only = True              
set queue rhel resources_max.opsys = RHEL          
set queue rhel resources_max.walltime = 360:00:00  
set queue rhel resources_min.opsys = RHEL          
set queue rhel enabled = True
set queue rhel started = True

create queue default
set queue default queue_type = Route
set queue default resources_default.opsys = SL
set queue default route_destinations = SciLinux
set queue default route_destinations += rhel
set queue default enabled = True
set queue default started = True

set server acl_hosts = xxx
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server resources_default.ncpus = 1
set server resources_default.nodes = 1
set server resources_default.opsys = SL
set server resources_default.walltime = 01:00:00
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 79



Maui.cfg
--------
SERVERHOST xxx

ADMIN1 root
ADMINHOST localhost

RMTYPE[0] PBS
RMHOST[0] localhost
RMSERVER[0] localhost

RMPOLLINTERVAL 00:00:10

SERVERPORT 40559
SERVERMODE NORMAL

LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3

ENFORCERESOURCELIMITS ON

QUEUETIMEWEIGHT 1

BACKFILLDEPTH 0
BACKFILLMETRIC PROCS
BACKFILLPOLICY BESTFIT

QOSCFG[monopol] QFLAGS=DEDICATED

CLASSWEIGHT 10
CLASSCFG[cuda] QLIST=monopol QDEF=monopol

--
---------------------------------------------------------
| | Dr. rer. nat. Stephan Raub
| | Dipl. Chem.
| | Lehrstuhl für IT-Management / ZIM
| | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 /
| | 25.41.O2.25-2
| | 40225 Düsseldorf / Germany
| |
| | Tel: +49-211-811-3911
---------------------------------------------------------




More information about the mauiusers mailing list