[torqueusers] jobs being held in substate 22 JOB_SUBSTATE_DEPNHOLD
Garrick Staples
garrick at clusterresources.com
Wed Oct 25 20:52:22 MDT 2006
On Wed, Oct 25, 2006 at 12:23:37PM -0700, Marc Schraffenberger alleged:
> We are running Torque 2.0.0p8
Lots of subnode counting bugs were fixed in the 2.1 line. I'd imagine
that 2.1.6 will handle this correctly.
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Garrick Staples
> Sent: Wed 10/25/2006 9:28 AM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] jobs being held in substate 22 JOB_SUBSTATE_DEPNHOLD
>
> On Tue, Oct 24, 2006 at 11:49:46AM -0500, Marc Schraffenberger alleged:
> > I have a large number of jobs that are being held because of
> > dependencies (at least that is what I gather from the job substate)
> > but I don't see why since the execution time has past and there are
> > only beforeany dependencies. I was wondering if anyone could help
> > clarify this for me.
>
> What version of TORQUE is this? We fixed some bugs a long time ago with
> failed jobs not properly releasing their deps.
>
>
> > Here are some details on a particular job (some other jobs have
> > dependencies on this one but have it in the "afterany" type):
> >
> > Job Id: 495325.localhost
> > Job_Name = t1073
> > Job_Owner = cdrone at localhost
> > job_state = H
> > queue = mediumpriority
> > server = localhost
> > Checkpoint = u
> > ctime = Wed Sep 20 01:17:00 2006
> > depend =
> > beforeany:495511.localhost at localhost:495655.localhost at localhost:49
> > 5823.localhost at localhost:496046.localhost at localhost:497005.localhost at lo
> > calhost:497086.localhost at localhost:497256.localhost at localhost:497351.lo
> > .......
> > st:517616.localhost at localhost:517668.localhost at localhost:517806.localho
> > st at localhost:518008.localhost at localhost:518104.localhost at localhost:5182
> > 69.localhost at localhost:518459.localhost at localhost:519822.localhost at loca
> > lhost:519957.localhost at localhost
> > Error_Path = localhost://t1073.e495325
> > Hold_Types = u
> > Join_Path = n
> > Keep_Files = n
> > Mail_Points = a
> > mtime = Tue Sep 26 01:07:45 2006
> > Output_Path = localhost://t1073.o495325
> > Priority = 0
> > qtime = Wed Sep 20 01:17:00 2006
> > Rerunable = True
> > Resource_List.db_free = 1
> > Resource_List.mem = 319mb
> > Resource_List.nice = 0
> > substate = 22
> > Variable_List = PBS_O_HOME=/root,PBS_O_LOGNAME=root,
> > PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bi
> > n,PBS_O_MAIL=/var/mail/root,PBS_O_SHELL=/bin/bash,PBS_O_HOST=localhost,
> > PBS_O_WORKDIR=/,
> > PBS_ARGUMENTS=-d3 -P --distribution 4 --accountid 180 --update,
> > PBS_FILENAME=/usr/local/tsa/bidmgr/sebidmgr.sh,PBS_RETRIES=0,
> > PBS_O_QUEUE=mediumpriority
> > euser = cdrone
> > egroup = cdrone
> > queue_rank = 452584
> > queue_type = E
> > comment = Not Running: Strict fifo order
> >
> >
> >
> > Job: 495325.localhost
> >
> > 09/20/2006 01:17:00 S enqueuing into mediumpriority, state 3 hop 1
> > 09/20/2006 01:17:00 S Job Queued at request of cdrone at localhost,
> > owner = cdrone at localhost, job name = t1073, queue = mediumpriority
> > 09/20/2006 01:17:00 S Dependency request for job rejected by
> > 491698.localhost
> > 09/20/2006 01:17:00 A queue=mediumpriority
> > 09/20/2006 01:17:27 S Job Modified at request of Scheduler at localhost
> > 09/20/2006 01:18:02 S Dependency on job 492090.localhost released.
> > 09/20/2006 01:18:04 S Dependency on job 491837.localhost released.
> > 09/20/2006 05:53:32 S Dependency on job 493304.localhost released.
> > 09/20/2006 05:53:33 S Dependency on job 493171.localhost released.
> > 09/20/2006 05:53:33 S Dependency on job 493021.localhost released.
> > 09/20/2006 05:53:33 S Dependency on job 492983.localhost released.
> > 09/20/2006 07:38:16 S Dependency on job 493513.localhost released.
> > 09/20/2006 07:38:16 S Dependency on job 493376.localhost released.
> > 09/20/2006 10:01:47 S Dependency on job 493902.localhost released.
> > 09/20/2006 10:01:48 S Dependency on job 493782.localhost released.
> > 09/20/2006 10:01:48 S Dependency on job 493614.localhost released.
> > 09/20/2006 14:28:23 S Dependency on job 494601.localhost released.
> > 09/20/2006 14:28:23 S Dependency on job 494428.localhost released.
> > 09/20/2006 14:28:24 S Dependency on job 494180.localhost released.
> > 09/20/2006 14:28:24 S Dependency on job 494061.localhost released.
> > 09/20/2006 19:27:41 S Dependency on job 495254.localhost released.
> > 09/20/2006 19:27:42 S Dependency on job 495042.localhost released.
> > 09/20/2006 19:27:42 S Dependency on job 494975.localhost released.
> > 09/20/2006 19:27:42 S Dependency on job 494818.localhost released.
> > 09/20/2006 19:27:42 S Dependency on job 494703.localhost released.
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list