[torqueusers] jobs stuck in Q

rishi pathak mailmaverick666 at gmail.com
Mon Dec 15 23:20:52 MST 2008


Check if pbs_sched is running or not. Also check that iptables should be
turned off on pbs_server

On Mon, Dec 15, 2008 at 10:43 PM, Adrian Sevcenco
<Adrian.Sevcenco at cern.ch>wrote:

> Greenseid, Joseph M. wrote:
> > what scheduler are you using?  are you using torque's scheduler, or
> > maui, or something else?
> Hi,
> I am using maui .. do you think that the problem can be there?
> Now i see that when i try to restart maui.cfg i have :
> ERROR:    lost connection to server
> ERROR:    cannot request service (status)
> i have this as maui.cfg
> [root at grid01 maui]# cat maui.cfg
> # MAUI configuration example
>
> SERVERHOST              grid01.spacescience.ro
> ADMIN1                  root
> ADMIN3                  edginfo rgma edguser
> ADMINHOSTS              grid01.spacescience.ro
> RMCFG[base]             TYPE=PBS
> SERVERPORT              40559
> SERVERMODE              NORMAL
>
> # Set PBS server polling interval. If you have short # queues or/and
> jobs it is worth to set a short interval. (10 seconds)
>
> RMPOLLINTERVAL        00:00:10
>
> # a max. 10 MByte log file in a logical location
>
> LOGFILE               /var/log/maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              1
>
> # Set the delay to 1 minute before Maui tries to run a job again, # in
> case it failed to run the first time.
> # The default value is 1 hour.
>
> DEFERTIME       00:01:00
>
> # Necessary for MPI grid jobs
> ENABLEMULTIREQJOBS TRUE
>
> Any idea anyone ?
> Thanks,
> Adrian
>
>
> > --Joe
> >
> > ------------------------------------------------------------------------
> > *From:* torqueusers-bounces at supercluster.org on behalf of Adrian
> Sevcenco
> > *Sent:* Mon 12/15/2008 10:40 AM
> > *To:* torqueusers at supercluster.org
> > *Subject:* [torqueusers] jobs stuck in Q
> >
> > Hi,
> > I have a server in which jobs are stucked in queue. i have this output
> > from qstat -f :
> > Job Id: 2.grid01.spacescience.ro
> >     Job_Name = STDIN
> >     Job_Owner = alice001 at grid01.spacescience.ro
> >     job_state = Q
> >     queue = alice
> >     server = grid01.spacescience.ro
> >     Checkpoint = u
> >     ctime = Mon Dec 15 17:19:39 2008
> >     Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
> >     Hold_Types = n
> >     Join_Path = n
> >     Keep_Files = n
> >     Mail_Points = a
> >     mtime = Mon Dec 15 17:20:22 2008
> >     Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
> >     Priority = 0
> >     qtime = Mon Dec 15 17:20:49 2008
> >     Rerunable = True
> >     Resource_List.cput = 48:00:00
> >     Resource_List.walltime = 72:00:00
> >     Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,
> >         PBS_O_LOGNAME=alice001,
> >
> > PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
> >         :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,
> >         PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,
> >         PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=
> grid01.spacescience.ro,
> >         PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice
> >     etime = Mon Dec 15 17:20:49 2008
> >     submit_args = -q alice
> >
> > and a momctl on a wn gives me this :
> > [root at grid01 ~]# momctl -d 3 -h wn01
> >
> > Host: wn01.spacescience.ro/wn01.spacescience.ro   Version:
> > 2.3.0-snap.200801151629   PID: 7248
> > Server[0]: grid01.spacescience.ro (172.16.0.254)
> >   Init Msgs Received:     0 hellos/1 cluster-addrs
> >   Init Msgs Sent:         1 hellos
> >   Last Msg From Server:   284242 seconds (CLUSTER_ADDRS)
> >   Last Msg To Server:     21 seconds
> > HomeDirectory:          /var/spool/pbs/mom_priv
> > stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
> > available)
> > NOTE:  syslog enabled
> > MOM active:             284244 seconds
> > Server Update Interval: 45 seconds
> > LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
> > Communication Model:    RPP
> > MemLocked:              TRUE  (mlock)
> > TCP Timeout:            20 seconds
> > Prolog:                 /var/spool/pbs/mom_priv/prologue (disabled)
> > Alarm Time:             0 of 10 seconds
> > Trusted Client List:
> > 172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,
> 127.0.0.1
> > Copy Command:           /usr/bin/scp -rpB
> > NOTE:  no local jobs detected
> >
> > diagnostics complete
> >
> > What can be wrong and where should i look into ?
> > Thanks for any help,
> > Adrian
> >
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Regards--
Rishi Pathak
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081216/995a93c9/attachment-0001.html


More information about the torqueusers mailing list