[torqueusers] jobs stuck in Q
Adrian Sevcenco
Adrian.Sevcenco at cern.ch
Mon Dec 15 10:13:41 MST 2008
Greenseid, Joseph M. wrote:
> what scheduler are you using? are you using torque's scheduler, or
> maui, or something else?
Hi,
I am using maui .. do you think that the problem can be there?
Now i see that when i try to restart maui.cfg i have :
ERROR: lost connection to server
ERROR: cannot request service (status)
i have this as maui.cfg
[root at grid01 maui]# cat maui.cfg
# MAUI configuration example
SERVERHOST grid01.spacescience.ro
ADMIN1 root
ADMIN3 edginfo rgma edguser
ADMINHOSTS grid01.spacescience.ro
RMCFG[base] TYPE=PBS
SERVERPORT 40559
SERVERMODE NORMAL
# Set PBS server polling interval. If you have short # queues or/and
jobs it is worth to set a short interval. (10 seconds)
RMPOLLINTERVAL 00:00:10
# a max. 10 MByte log file in a logical location
LOGFILE /var/log/maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 1
# Set the delay to 1 minute before Maui tries to run a job again, # in
case it failed to run the first time.
# The default value is 1 hour.
DEFERTIME 00:01:00
# Necessary for MPI grid jobs
ENABLEMULTIREQJOBS TRUE
Any idea anyone ?
Thanks,
Adrian
> --Joe
>
> ------------------------------------------------------------------------
> *From:* torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
> *Sent:* Mon 12/15/2008 10:40 AM
> *To:* torqueusers at supercluster.org
> *Subject:* [torqueusers] jobs stuck in Q
>
> Hi,
> I have a server in which jobs are stucked in queue. i have this output
> from qstat -f :
> Job Id: 2.grid01.spacescience.ro
> Job_Name = STDIN
> Job_Owner = alice001 at grid01.spacescience.ro
> job_state = Q
> queue = alice
> server = grid01.spacescience.ro
> Checkpoint = u
> ctime = Mon Dec 15 17:19:39 2008
> Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = a
> mtime = Mon Dec 15 17:20:22 2008
> Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
> Priority = 0
> qtime = Mon Dec 15 17:20:49 2008
> Rerunable = True
> Resource_List.cput = 48:00:00
> Resource_List.walltime = 72:00:00
> Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,
> PBS_O_LOGNAME=alice001,
>
> PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
> :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,
> PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,
> PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=grid01.spacescience.ro,
> PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice
> etime = Mon Dec 15 17:20:49 2008
> submit_args = -q alice
>
> and a momctl on a wn gives me this :
> [root at grid01 ~]# momctl -d 3 -h wn01
>
> Host: wn01.spacescience.ro/wn01.spacescience.ro Version:
> 2.3.0-snap.200801151629 PID: 7248
> Server[0]: grid01.spacescience.ro (172.16.0.254)
> Init Msgs Received: 0 hellos/1 cluster-addrs
> Init Msgs Sent: 1 hellos
> Last Msg From Server: 284242 seconds (CLUSTER_ADDRS)
> Last Msg To Server: 21 seconds
> HomeDirectory: /var/spool/pbs/mom_priv
> stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
> available)
> NOTE: syslog enabled
> MOM active: 284244 seconds
> Server Update Interval: 45 seconds
> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> MemLocked: TRUE (mlock)
> TCP Timeout: 20 seconds
> Prolog: /var/spool/pbs/mom_priv/prologue (disabled)
> Alarm Time: 0 of 10 seconds
> Trusted Client List:
> 172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,127.0.0.1
> Copy Command: /usr/bin/scp -rpB
> NOTE: no local jobs detected
>
> diagnostics complete
>
> What can be wrong and where should i look into ?
> Thanks for any help,
> Adrian
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3105 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081215/aa27edfe/smime.bin
More information about the torqueusers
mailing list