[torqueusers] job schedule and queus

RB. Ezhilalan (Principal Physicist, CUH) RB.Ezhilalan at hse.ie
Thu Oct 11 09:58:50 MDT 2012


Hi All,

I am new to Torque and have been using it for scheduling parallel
montecarlo calculations over multiple CPUs on different PCs.

I had set-up two new queues namely 'long' and 'short' besides the
default queue 'batch'. I noticed when I submit multiple jobs without
specifying any particular queue name then the pbs server/scheduler
schedules those 7 jobs immediately as the default 'batch' queue. The
qstat -q command immediately after submitting job shows that 7 jobs are
running ( no '7' under 'R').

However, when I used either the 'short' or 'long' queue name to launch 7
jobs then the jobs go in to a queue. The qstat -q command in this
instance shows that 1 job running and 6 jobs in queue but those jobs
eventually got executed perhaps taking longer time to complete than
running same jobs with the default queue 'batch'. Although all CPU were
free on both cases, why in the case of default queue, the jobs were
immediately scheduled where as this is not the case with the other
queues. Could I get some advise on this?

Many thanks,

Ezhilalan Ramalingam M.Sc.,DABR.,
Principal Physicist (Radiotherapy),
Medical Physics Department,
Cork University Hospital,
Wilton, Cork
Ireland
Tel. 00353 21 4922533
Fax.00353 21 4921300
Email: rb.ezhilalan at hse.ie 

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of
torqueusers-request at supercluster.org
Sent: 11 October 2012 07:31
To: torqueusers at supercluster.org
Subject: torqueusers Digest, Vol 99, Issue 9

Send torqueusers mailing list submissions to
	torqueusers at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
	torqueusers-request at supercluster.org

You can reach the person managing the list at
	torqueusers-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

   1. login shells not run in torque 4 and bash -l (Charles Henry)
   2. Maui is not submitting jobs to torque (Alessandra Forti)
   3. Re: Maui is not submitting jobs to torque (Bas van der Vlies)


----------------------------------------------------------------------

Message: 1
Date: Wed, 10 Oct 2012 14:30:27 -0500 (CDT)
From: Charles Henry <chenry at ittc.ku.edu>
Subject: [torqueusers] login shells not run in torque 4 and bash -l
To: torqueusers at supercluster.org
Message-ID: <135357755.673805.1349897427580.JavaMail.root at ittc.ku.edu>
Content-Type: text/plain; charset=utf-8

Hi list,

I have been following the torque 4 development, and I'm currently using
torque 4.1.2 on RHEL6.2.  I have found that I cannot get cluster jobs to
run correctly without using "#!/bin/bash -l" in each script.  A few
sites (academic and government) are listing this workaround in their
cluster FAQs.

Our site uses mpi-selector and needs to source /etc/profile for every
cluster job (interactive or not).  I have looked for settings and even
gone so far as reading the source code.  

The relevant settings are defined globally inside src/resmom/mom_main.c
... (line 205)
int      src_login_batch = TRUE;
int      src_login_interactive = TRUE;
...

and in src/resmom/start_exec.c
... (line 3736)
void source_login_shells_or_not(
...
  if (((TJE->is_interactive == TRUE) && (src_login_interactive ==
FALSE)) ||
      ((TJE->is_interactive != TRUE) && (src_login_batch == FALSE)))
...

Where those values are declared as "extern int", so the values from
mom_main.c are accessible once the binaries are linked.

There's no error message from the source_login_shells_or_not function,
and the code looks very similar to the torque-3 code (except for being
wrapped up into functions).  Can anyone shed some light on the problem?

Chuck


------------------------------

Message: 2
Date: Wed, 10 Oct 2012 20:30:42 +0100
From: Alessandra Forti <Alessandra.Forti at cern.ch>
Subject: [torqueusers] Maui is not submitting jobs to torque
To: <torqueusers at supercluster.org>, <mauiusers at supercluster.org>
Message-ID: <5075CCE2.1090300 at cern.ch>
Keywords: CERN SpamKiller Note: -50
Content-Type: text/plain; charset="iso-8859-1"

Hi,

I have installed a mini test cluster with torque and maui. We have used 
maui/torque for years on our grid cluster and now we are upgrading to 
torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination 
maui doesn't seem to work correctly. When I submit jobs and it behaves 
as if there weren't any free resources. Even when I tried to install 
only torque and maui with a bare minimum configuration I got the same 
behaviour, i.e.

1) When I submit the jobs just remain queued

//[root@//<server> maui]# /qstat -an1//
//
//<server>: //
//Req'd  Req'd   Elap//
//Job ID               Username Queue    Jobname SessID NDS   TSK Memory

Time  S Time//
//-------------------- -------- -------- ---------------- ------ ----- 
--- ------ ----- - -----//
//10.<server>     aforti   long pbs-vm3.sh          --    --   --    
--    --  Q   --     -- //
//11.s<server>    aforti   long pbs-vm3.sh          --    --   --    
--    --  Q   --     -- /

2) If I run qrun <jobid> the job runs so I assume the problem is not 
between torque server and torque mom.
3) When I use showq on the old versions displayed the WCLimit of the 
default queue now it displays 0 at first and then it changes it by 
itself to 100 days

/showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC REMAINING
STARTTIME


      0 Active Jobs       0 of   16 Processors Active (0.00%)
                          0 of    1 Nodes Active      (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC WCLIMIT
QUEUETIME

2                    aforti       Idle     1 99:23:59:59  Wed Oct 10 
13:36:34
3                    aforti       Idle     1 99:23:59:59  Wed Oct 10 
14:01:43
4                    aforti       Idle     1 99:23:59:59  Wed Oct 10 
18:50:14
5                    aforti       Idle     1    00:00:00  Wed Oct 10 
20:29:27

4 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC WCLIMIT
QUEUETIME


Total Jobs: 4   Active Jobs: 0   Idle Jobs: 4   Blocked Jobs: 0
//
//
//Total Jobs: 2   Active Jobs: 0   Idle Jobs: 2   Blocked Jobs: 0//
/
4) Checkjob <jobid> just tells me the job cannot be run in the default 
partition without any particular reason

/[.....]
PE:  1.00  StartPriority:  120//
//cannot select job 10 for partition DEFAULT (Class)/

5) Checknode can see the node free if it wasn't clear from other
commands

/[root@//<server> maui]# !checkno//
//checknode <node>//
//
//checking node <node>//
//
//State:      Idle  (in current state for 00:55:10)//
//Configured Resources: PROCS: 16  MEM: 23G  SWAP: 31G DISK: 1M//
//Utilized   Resources: SWAP: 202M//
//Dedicated  Resources: [NONE]//
//Opsys:         linux  Arch:      [NONE]//
//Speed:      1.00  Load:       0.000//
//Network:    [DEFAULT]//
//Features:   [lcgpro]//
//Attributes: [Batch]//
//Classes:    [DEFAULT 1:1]//
//
//Total Time: 3:06:35  Up: 3:06:24 (99.90%)  Active: 00:00:10 (0.09%)//
//
//Reservations://
//NOTE:  no reservations on node/

6) When I use showbf -v though it says my nodes are blocked by 
reservations despite checknode clearly telling me there are no 
reservations on that node. In our local maui.cfg there is a reservation 
for 1 proc I'm not sure why it blocks the whole node

/[root@//<server2> server_logs]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  9

17:08:59//
//
//  3 procs available with no timelimit//
//
//node <node2> is blocked by reservation sft.0.0 in INFINITY//
/
But to be sure I removed it and even when I remove the reservation and 
reduce the maui.cfg to the default version without anything in it it 
tells me the node is blocked by "reservation NONE in INFINITY"

/[root@//<server> maui]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  9

17:37:58//
//
// 16 procs available with no timelimit//
//
//node <node> is blocked by reservation NONE in INFINITY//
/
If I increase the maui loglevel to 9 I hundreds of these messages

/10/10 13:37:39 MRMCheckEvents()//
//10/10 13:37:39 INFO:     no PBS sched socket connections ready//
//10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)//
//10/10 13:37:39 INFO:     accept call failed, errno: 11 (Resource 
temporarily unavailable)//
//10/10 13:37:39 INFO:     all clients connected. servicing requests//
/
which leaves me perplexed since in other places with a different log 
level it sees the jobs waiting on the server so somehow some 
comunication happens and other doesn't

/10/10 20:27:24 INFO:     job '2' Priority: 410//
//10/10 20:27:24 INFO:     Cred:      0(00.0) FS:      0(00.0)  
Attr:      0(00.0)  Serv: 410(00.0)  Targ:      0(00.0)  Res:      
0(00.0) Us:      0(00.0)//
//10/10 20:27:24 INFO:     job '2'  priority: 410.30//
//10/10 20:27:24 MJobGetStartPriority(3,0,Priority,NULL)//
//10/10 20:27:24 INFO:     job '3' Priority: 385//
//10/10 20:27:24 INFO:     Cred:      0(00.0) FS:      0(00.0)  
Attr:      0(00.0)  Serv: 385(00.0)  Targ:      0(00.0)  Res:      
0(00.0) Us:      0(00.0)//
//10/10 20:27:24 INFO:     job '3'  priority: 385.30//
//10/10 20:27:24 MJobGetStartPriority(4,0,Priority,NULL)//
//10/10 20:27:24 INFO:     job '4' Priority: 97//
//10/10 20:27:24 INFO:     Cred:      0(00.0) FS:      0(00.0)  
Attr:      0(00.0)  Serv: 97(00.0)  Targ:      0(00.0)  Res:      
0(00.0) Us:      0(00.0)//
//10/10 20:27:24 INFO:     job '4'  priority: 97.17//
/
Thanks for any help here are the rpms I used

/maui-3.3-4.el5//
//maui-client-3.3-4.el5//
//maui-server-3.3-4.el5//
//torque-2.5.7-7.el5//
//torque-client-2.5.7-7.el5//
//torque-server-2.5.7-7.el5//
//libtorque-2.5.7-7.el5//
/
the maui.cfg

/#
# MAUI configuration example
# @(#)maui.cfg David Groep 20031015.1
# for MAUI version 3.2.5
#
SERVERHOST              <server>/
/ADMIN1                  root
ADMINHOST               <server>/
/RMTYPE[0]           PBS
RMHOST[0]           <server>/
/RMSERVER[0]         <server>/
/
SERVERPORT            40559
SERVERMODE            NORMAL

# Set PBS server polling interval. Since we have many short jobs
# and want fast turn-around, set this to 10 seconds (default: 2 minutes)
RMPOLLINTERVAL        00:00:10

# a max. 10 MByte log file in a logical location
LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3/

and Torque config

/create queue long//
//set queue long queue_type = Execution//
//set queue long acl_hosts = localhost//
//set queue long acl_hosts += <server>//
//set queue long resources_max.cput = 48:00:00//
//set queue long resources_max.walltime = 72:00:00//
//set queue long acl_group_enable = True//
//set queue long acl_groups = aforti//
//set queue long enabled = True//
//set queue long started = True//
//#//
//# Set server attributes.//
//#//
//set server scheduling = True//
//set server acl_host_enable = False//
//set server acl_hosts = <server>//
//set server acl_hosts += localhost//
//set server default_queue = long//
//set server log_events = 511//
//set server mail_from = adm//
//set server next_job_number = 12/

-- 
Facts aren't facts if they come from the wrong people. (Paul Krugman)




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20121010/8
e5bc7cc/attachment-0001.html 

------------------------------

Message: 3
Date: Thu, 11 Oct 2012 06:31:11 +0000
From: Bas van der Vlies <basv at sara.nl>
Subject: Re: [torqueusers] Maui is not submitting jobs to torque
To: Torque Users Mailing List <torqueusers at supercluster.org>
Cc: "<mauiusers at supercluster.org>" <mauiusers at supercluster.org>
Message-ID:
	<74EB35DC444C754DA400390F673C23D6AF6F17 at sara-exch-3.ka.sara.nl>
Content-Type: text/plain; charset="iso-8859-1"


On 10 okt. 2012, at 21:30, Alessandra Forti
<Alessandra.Forti at cern.ch<mailto:Alessandra.Forti at cern.ch>> wrote:

node <node2> is blocked by reservation sft.0.0 in   INFINITY

This message informs that there is a reservation on the node.  what is
the output of showres?

--
Bas van der Vlies
basv at sara.nl<mailto:basv at sara.nl>





------------------------------

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


End of torqueusers Digest, Vol 99, Issue 9
******************************************


More information about the torqueusers mailing list