[torqueusers] jobs being held in substate 22 JOB_SUBSTATE_DEPNHOLD

Garrick Staples garrick at clusterresources.com
Wed Oct 25 20:52:22 MDT 2006


On Wed, Oct 25, 2006 at 12:23:37PM -0700, Marc Schraffenberger alleged:
> We are running Torque 2.0.0p8 

Lots of subnode counting bugs were fixed in the 2.1 line.  I'd imagine
that 2.1.6 will handle this correctly.

 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Garrick Staples
> Sent: Wed 10/25/2006 9:28 AM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] jobs being held in substate 22 JOB_SUBSTATE_DEPNHOLD
>  
> On Tue, Oct 24, 2006 at 11:49:46AM -0500, Marc Schraffenberger alleged:
> > I have a large number of jobs that are being held because of
> > dependencies (at least that is what I gather from the job substate)
> > but I don't see why since the execution time has past and there are
> > only beforeany dependencies. I was wondering if anyone could help
> > clarify this for me.
> 
> What version of TORQUE is this?  We fixed some bugs a long time ago with
> failed jobs not properly releasing their deps.
> 
>  
> > Here are some details on a particular job (some other jobs have
> > dependencies on this one but have it in the "afterany" type):
> > 
> > Job Id: 495325.localhost
> >    Job_Name = t1073
> >    Job_Owner = cdrone at localhost
> >    job_state = H
> >    queue = mediumpriority
> >    server = localhost
> >    Checkpoint = u
> >    ctime = Wed Sep 20 01:17:00 2006
> >    depend = 
> >    beforeany:495511.localhost at localhost:495655.localhost at localhost:49
> >        5823.localhost at localhost:496046.localhost at localhost:497005.localhost at lo
> >        calhost:497086.localhost at localhost:497256.localhost at localhost:497351.lo
> >        .......
> >        st:517616.localhost at localhost:517668.localhost at localhost:517806.localho
> >        st at localhost:518008.localhost at localhost:518104.localhost at localhost:5182
> >        69.localhost at localhost:518459.localhost at localhost:519822.localhost at loca
> >        lhost:519957.localhost at localhost
> >    Error_Path = localhost://t1073.e495325
> >    Hold_Types = u
> >    Join_Path = n
> >    Keep_Files = n
> >    Mail_Points = a
> >    mtime = Tue Sep 26 01:07:45 2006
> >    Output_Path = localhost://t1073.o495325
> >    Priority = 0
> >    qtime = Wed Sep 20 01:17:00 2006
> >    Rerunable = True
> >    Resource_List.db_free = 1
> >    Resource_List.mem = 319mb
> >    Resource_List.nice = 0
> >    substate = 22
> >    Variable_List = PBS_O_HOME=/root,PBS_O_LOGNAME=root,
> >        PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bi
> >        n,PBS_O_MAIL=/var/mail/root,PBS_O_SHELL=/bin/bash,PBS_O_HOST=localhost,
> >        PBS_O_WORKDIR=/,
> >        PBS_ARGUMENTS=-d3 -P --distribution 4 --accountid 180 --update,
> >        PBS_FILENAME=/usr/local/tsa/bidmgr/sebidmgr.sh,PBS_RETRIES=0,
> >        PBS_O_QUEUE=mediumpriority
> >    euser = cdrone
> >    egroup = cdrone
> >    queue_rank = 452584
> >    queue_type = E
> >    comment = Not Running: Strict fifo order
> > 
> > 
> > 
> > Job: 495325.localhost
> > 
> > 09/20/2006 01:17:00  S    enqueuing into mediumpriority, state 3 hop 1
> > 09/20/2006 01:17:00  S    Job Queued at request of cdrone at localhost,
> > owner = cdrone at localhost, job name = t1073, queue = mediumpriority
> > 09/20/2006 01:17:00  S    Dependency request for job rejected by
> > 491698.localhost
> > 09/20/2006 01:17:00  A    queue=mediumpriority
> > 09/20/2006 01:17:27  S    Job Modified at request of Scheduler at localhost
> > 09/20/2006 01:18:02  S    Dependency on job 492090.localhost released.
> > 09/20/2006 01:18:04  S    Dependency on job 491837.localhost released.
> > 09/20/2006 05:53:32  S    Dependency on job 493304.localhost released.
> > 09/20/2006 05:53:33  S    Dependency on job 493171.localhost released.
> > 09/20/2006 05:53:33  S    Dependency on job 493021.localhost released.
> > 09/20/2006 05:53:33  S    Dependency on job 492983.localhost released.
> > 09/20/2006 07:38:16  S    Dependency on job 493513.localhost released.
> > 09/20/2006 07:38:16  S    Dependency on job 493376.localhost released.
> > 09/20/2006 10:01:47  S    Dependency on job 493902.localhost released.
> > 09/20/2006 10:01:48  S    Dependency on job 493782.localhost released.
> > 09/20/2006 10:01:48  S    Dependency on job 493614.localhost released.
> > 09/20/2006 14:28:23  S    Dependency on job 494601.localhost released.
> > 09/20/2006 14:28:23  S    Dependency on job 494428.localhost released.
> > 09/20/2006 14:28:24  S    Dependency on job 494180.localhost released.
> > 09/20/2006 14:28:24  S    Dependency on job 494061.localhost released.
> > 09/20/2006 19:27:41  S    Dependency on job 495254.localhost released.
> > 09/20/2006 19:27:42  S    Dependency on job 495042.localhost released.
> > 09/20/2006 19:27:42  S    Dependency on job 494975.localhost released.
> > 09/20/2006 19:27:42  S    Dependency on job 494818.localhost released.
> > 09/20/2006 19:27:42  S    Dependency on job 494703.localhost released.
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list