[Mauiusers] Parallel Job with more than one compute node doesn't start!

Vamvakopoulos Manolis evamvak at cs.uoi.gr
Sun Dec 13 07:28:54 MST 2009


Dear Dr. Stephan Raub

you can  check the values  of maui configuration
with command showconfig |more

and you can add (if is not set) the follow features  ..

ENABLEMULTINODEJOBS           TRUE
JOBNODEMATCHPOLICY            EXACTNODE
ENABLEMULTINODEJOBS           TRUE
NODEACCESSPOLICY              SHARED
~
specially with

JOBNODEMATCHPOLICY            EXACTNODE

you can submit parallel jobs with
qsub -l nodes=4:ppn=1 ( 1 core per 4 nodes)
or
qsub -l nodes=1:ppn=3 ( 4 core per 1 nodes)

also with command checkjob jobid

you can take a lot of infomation for the job status and resources


best


E.V.
-- 
University OF Ioannina
Department of Computer Science
P.O. BOX 1186 Ioannina, Greece
Tel: (+30)-26510-98864
Fax: (+30)-26510-98890


Quoting "Dr. Stephan Raub" <raub at uni-duesseldorf.de>:

> Hello everyone
>
> I have a strange problem with jobs, which want to use more than one
> computenode. I tried a simple test-skript like:
>
>
> #PBS -l walltime=00:20:00
> #PBS -l nodes=2:ppn=1
> #PBS -N testjob5
>
> It stays in the queue with status &#65533;Q&#65533; for eternity. qstat
&#65533;f shows an
> increasing number in &#65533;start_count&#65533; and &#65533;exit_status =
-3&#65533;. I found out, that
> the scheduler (maui) already assigned this job to two nodes. I set
> $logevent=255 and $loglevel=7 for this two nodes (node2 and node3) and found
> the relevant parts, which you can find below.
>
> Jobs with #PBS &#65533;l nodes=1:ppn=4 start normally on 4 Cores of one node.
>
> Please, I would really welcome any help you can give.
>
> Thank You in advance.
>
> Stephan
>
>
> Log of Node2
> ------------
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;ready to commit job
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;ready to commit job completed
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> Commit from PBS_Server
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type Commit
> from host .blabla.cluster received
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type Commit
> from host .blabla.cluster allowed
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
> Commit on sd=10
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;committing job
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;starting job execution
> 12/11/2009 14:03:53;0001;   pbs_mom;Job;job_nodes;0: .blabla-2/0
> 12/11/2009 14:03:53;0001;   pbs_mom;Job;job_nodes;1: .blabla-3/0
> 12/11/2009 14:03:53;0001;   pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2
> numvnod=2
> 12/11/2009 14:03:53;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
> pre-sigprocmask
> 12/11/2009 14:03:53;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups,
> post-initgroups
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;job 78.xxx reported
> successful start on 2 node(s)
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;job execution started
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> Disconnect from PBS_Server
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> StatusJob from PBS_Server
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
> StatusJob from host .blabla.cluster received
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
> StatusJob from host .blabla.cluster allowed
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
> StatusJob on sd=10
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> ModifyJob from PBS_Server
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
> ModifyJob from host .blabla.cluster received
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
> ModifyJob from host .blabla.cluster allowed
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
> ModifyJob on sd=14
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;modifying job
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;modifying type 6 attribute
> session_id of job (value: '???')
> 12/11/2009 14:03:53;0002;
> pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) entered
> 12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
> attribute 'ncpus'
> 12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
> attribute 'neednodes'
> 12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
> attribute 'nodes'
> 12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_set_limits;setting limit for
> attribute 'walltime'
> 12/11/2009 14:03:53;0002;
> pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) completed
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;Job Modified at request of
> PBS_Server at .blabla.cluster
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> Disconnect from PBS_Server
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> Disconnect from PBS_Server
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;do_rpp;got an internal task manager
> request in do_rpp
> 12/11/2009 14:03:53;0002;   pbs_mom;Svr;im_request;connect from
> 192.168.1.63:15003
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;received request 'ERROR' for
> job 78.xxx from 192.168.1.63:15003
> 12/11/2009 14:03:53;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad UID for job
> execution (15023) in 78.xxx, job_start_error from node 192.168.1.63:15003 in
> job_start_error
> 12/11/2009 14:03:53;0008;   pbs_mom;Req;send_sisters;sending command
> ABORT_JOB for job 78.xxx (10)
> 12/11/2009 14:03:53;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 12/11/2009 14:03:53;0080;   pbs_mom;Svr;scan_for_exiting;searching for
> exiting jobs
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;kill_job;scan_for_exiting: sending
> signal 9, "KILL" to job 78.xxx, reason: local task termination detected
> 12/11/2009 14:03:53;0002;   pbs_mom;n/a;run_pelog;userepilog script
> '/var/spool/torque/mom_priv/epilogue.precancel' for job 78.xxx does not
> exist (cwd: /var/spool/torque/mom_priv,pid: 12854)
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;kill_job done (killed 0
> processes)
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;sending preobit jobstat
> 12/11/2009 14:03:53;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 12/11/2009 14:03:53;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 12/11/2009 14:03:53;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;performing job clean-up
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;epilog subtask created with
> pid 12858 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue
> 12/11/2009 14:03:53;0002;   pbs_mom;n/a;mom_close_poll;entered
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;scan_for_terminated;entered
> 12/11/2009 14:03:53;0080;   pbs_mom;Svr;mom_get_sample;proc_array load
> started
> 12/11/2009 14:03:53;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded -
> nproc=194
> 12/11/2009 14:03:53;0080;   pbs_mom;n/a;cput_sum;proc_array loop start -
> jobid = 78.xxx
> 12/11/2009 14:03:53;0080;   pbs_mom;n/a;mem_sum;proc_array loop start -
> jobid = 78.xxx
> 12/11/2009 14:03:53;0080;   pbs_mom;n/a;resi_sum;proc_array loop start -
> jobid = 78.xxx
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;checking job w/subtask
> pid=12858 (child pid=12858)
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;checking job post-processing
> routine
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;post_epilogue;preparing obit message
> for job 78.xxx
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;encoding "send flagged" attr:
> Error_Path
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;obit sent to server
> 12/11/2009 14:03:53;0001;   pbs_mom;Job;78.xxx;setting job substate to
> EXITED
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> DeleteJob from PBS_Server
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
> DeleteJob from host .blabla.cluster received
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;process_request;request type
> DeleteJob from host .blabla.cluster allowed
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;dispatch_request;dispatching request
> DeleteJob on sd=10
> 12/11/2009 14:03:53;0008;   pbs_mom;Job;78.xxx;deleting job
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;deleting job 78.xxx in state
> EXITED
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;removing job
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;removed job script
> 12/11/2009 14:03:53;0080;   pbs_mom;Job;78.xxx;removed job file
> 12/11/2009 14:03:53;0080;   pbs_mom;Req;dis_request_read;decoding command
> Disconnect from PBS_Server
>
>
>
> Log of Node3
> ------------
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;do_rpp;got an internal task manager
> request in do_rpp
> 12/11/2009 14:03:40;0002;   pbs_mom;Svr;im_request;connect from
> 192.168.1.62:1022
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;received request 'JOIN_JOB'
> for job 78.xxx from 192.168.1.62:1022
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;im_request: JOIN_JOB 78.xxx
> node 1
> 12/11/2009 14:03:40;0001;   pbs_mom;Job;job_nodes;0: .blabla-2/0
> 12/11/2009 14:03:40;0001;   pbs_mom;Job;job_nodes;1: .blabla-3/0
> 12/11/2009 14:03:40;0001;   pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2
> numvnod=2
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;no group entry for group
> admin, user=raub, errno=0 (Success)
> 12/11/2009 14:03:40;0080;   pbs_mom;Job;78.xxx;removing job
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;do_rpp;got an internal task manager
> request in do_rpp
> 12/11/2009 14:03:40;0002;   pbs_mom;Svr;im_request;connect from
> 192.168.1.62:1022
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;received request 'ABORT_JOB'
> for job 78.xxx from 192.168.1.62:1022
> 12/11/2009 14:03:40;0008;   pbs_mom;Job;78.xxx;ERROR:    received request
> 'ABORT_JOB' from 192.168.1.62:1022 for job '78.xxx' (job does not exist
> locally)
>
>
> Output of qmgr &#65533;ps&#65533;
> -------------------
> create queue rhel
> set queue rhel queue_type = Execution
> set queue rhel from_route_only = True
> set queue rhel resources_max.opsys = RHEL
> set queue rhel resources_max.walltime = 360:00:00
> set queue rhel resources_min.opsys = RHEL
> set queue rhel enabled = True
> set queue rhel started = True
>
> create queue default
> set queue default queue_type = Route
> set queue default resources_default.opsys = SL
> set queue default route_destinations = SciLinux
> set queue default route_destinations += rhel
> set queue default enabled = True
> set queue default started = True
>
> set server acl_hosts = xxx
> set server default_queue = default
> set server log_events = 511
> set server mail_from = adm
> set server resources_default.ncpus = 1
> set server resources_default.nodes = 1
> set server resources_default.opsys = SL
> set server resources_default.walltime = 01:00:00
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server next_job_number = 79
>
>
>
> Maui.cfg
> --------
> SERVERHOST xxx
>
> ADMIN1 root
> ADMINHOST localhost
>
> RMTYPE[0] PBS
> RMHOST[0] localhost
> RMSERVER[0] localhost
>
> RMPOLLINTERVAL 00:00:10
>
> SERVERPORT 40559
> SERVERMODE NORMAL
>
> LOGFILE maui.log
> LOGFILEMAXSIZE 10000000
> LOGLEVEL 3
>
> ENFORCERESOURCELIMITS ON
>
> QUEUETIMEWEIGHT 1
>
> BACKFILLDEPTH 0
> BACKFILLMETRIC PROCS
> BACKFILLPOLICY BESTFIT
>
> QOSCFG[monopol] QFLAGS=DEDICATED
>
> CLASSWEIGHT 10
> CLASSCFG[cuda] QLIST=monopol QDEF=monopol
>
> --
> ---------------------------------------------------------
> | | Dr. rer. nat. Stephan Raub
> | | Dipl. Chem.
> | | Lehrstuhl für IT-Management / ZIM
> | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 /
> | | 25.41.O2.25-2
> | | 40225 Düsseldorf / Germany
> | |
> | | Tel: +49-211-811-3911
> ---------------------------------------------------------
>
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>


More information about the mauiusers mailing list