[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Arnau Bria arnaubria at pic.es
Thu Jun 3 10:34:18 MDT 2010


On Thu, 3 Jun 2010 18:14:12 +0200
Arnau Bria wrote:

Hi again,

> 
> After a kill of all jobs that reached walltime, and a restart of
> pbs_server (cause it hanged for more than 5 minutes), qstat -r (or
> qstat) do not show usage time in any runnning job:

It has started showing _some_ entries but not all. Appart, times have
no sense. Someone could take a look at this qstat/tracejob output?

[root at pbs02 ~]# qstat  10593114.pbs02.pic.es
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
10593114.pbs02            STDIN            atpilot002      01:01:27 R glong_sl5      
[root at pbs02 ~]# qstat -r  10593114.pbs02.pic.es

pbs02.pic.es: 
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
10593114.pbs02.p     atpilot0 glong_sl STDIN             11117     1   1    --    --  R 140:3

[root at pbs02 ~]# qstat -f 10593114.pbs02.pic.es
Job Id: 10593114.pbs02.pic.es
    Job_Name = STDIN
    Job_Owner = atpilot002 at ce07.pic.es
    resources_used.cput = 01:01:27
    resources_used.mem = 582176kb
    resources_used.vmem = 1384056kb
    resources_used.walltime = 140:36:50
    job_state = R
    queue = glong_sl5
    server = pbs02.pic.es
    Checkpoint = n
    ctime = Tue Jun  1 11:47:42 2010
    Error_Path = ce07.pic.es:/home/atpilot002/.lcgjm/globus-cache-export.B1089
	3/batch.err
    exec_host = td482.pic.es/1
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Tue Jun  1 11:49:39 2010
    Output_Path = ce07.pic.es:/home/atpilot002/.lcgjm/globus-cache-export.B108
	93/batch.out
    Priority = 0
    qtime = Tue Jun  1 11:47:42 2010
    Rerunable = False
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    session_id = 11117
    Shell_Path_List = /bin/sh
    stagein = globus-cache-export.B10893.gpg at ce07.pic.es:/home/atpilot002/.lcg
	jm/globus-cache-export.B10893/globus-cache-export.B10893.gpg
    substate = 42
    Variable_List = PBS_O_HOME=/home/atpilot002,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=atpilot002,
	PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
	glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
	in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
	PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=pbs02.pic.es,PBS_O_WORKDIR=/home/atpilot002,
	X509_USER_PROXY=/home/atpilot002/.globus/job/ce07.pic.es/9247.1275385
	581/x509_up,
	GLOBUS_REMOTE_IO_URL=/home/atpilot002/.lcgjm/.remote_io_ptr/remote_io
	_file-9247.1275385581,GLOBUS_LOCATION=/opt/globus,
	GLOBUS_GRAM_JOB_CONTACT=https://ce07.pic.es:20077/9247/1275385581/,
	GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ce07.pic.es:20078/,
	SCRATCH_DIRECTORY=/home/atpilot002/,HOME=/home/atpilot002,
	LOGNAME=atpilot002,PANDA_JSID=Xavier-ES,
	GTAG=http://vobox02.pic.es/PIC-Analysis-Factory/logs//2010-06-01/ANAL
	Y_PIC/1715236.3.out,FACTORYQUEUE=ANALY_PIC,
	GLOBUS_CE=ce07.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
	PBS_O_QUEUE=glong_sl5,PBS_O_HOST=ce07.pic.es
    euser = atpilot002
    egroup = atpilot
    hashname = 10593114.pbs02.pic.es
    queue_rank = 450537
    queue_type = E
    etime = Tue Jun  1 11:47:42 2010
    start_time = Tue Jun  1 11:49:39 2010
    start_count = 1
    fault_tolerant = False

[root at pbs02 ~]# tracejob -n4 10593114
Job: 10593114.pbs02.pic.es

06/01/2010 11:47:42  S    enqueuing into glong_sl5, state 1 hop 1
06/01/2010 11:47:42  S    Job Queued at request of atpilot002 at ce07.pic.es, owner = atpilot002 at ce07.pic.es, job name = STDIN, queue = glong_sl5
06/01/2010 11:47:42  A    queue=glong_sl5
06/01/2010 11:49:38  S    Job Modified at request of root at pbs02.pic.es
06/01/2010 11:49:38  S    Job Run at request of root at pbs02.pic.es
06/01/2010 11:49:38  S    Job Modified at request of root at pbs02.pic.es
06/01/2010 11:49:38  S    post_modify_req: PBSE_UNKJOBID for job 10593114.pbs02.pic.es in state RUNNING-STAGEGO, dest = td482.pic.es
06/01/2010 11:49:39  A    user=atpilot002 group=atpilot jobname=STDIN queue=glong_sl5 ctime=1275385662 qtime=1275385662 etime=1275385662 start=1275385779
                          owner=atpilot002 at ce07.pic.es exec_host=td482.pic.es/1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 
06/01/2010 13:03:12  S    enqueuing into glong_sl5, state 4 hop 1
06/01/2010 13:03:12  S    Requeueing job, substate: 42 Requeued in queue: glong_sl5
06/02/2010 10:48:25  S    enqueuing into glong_sl5, state 4 hop 1
06/02/2010 10:48:25  S    Requeueing job, substate: 42 Requeued in queue: glong_sl5
06/03/2010 18:03:27  S    enqueuing into glong_sl5, state 4 hop 1
06/03/2010 18:03:27  S    Requeueing job, substate: 42 Requeued in queue: glong_sl5


Few questions:
If it was submited on the 1st, how is possible that it has run for 140
hours?
If it's rerunable=False, why it has been Reququed  3 times?


TIA,
Arnau


More information about the torqueusers mailing list