[torqueusers] jobs queued with WN assigned

Arnau Bria arnaubria at pic.es
Thu Mar 4 09:41:10 MST 2010


Hi all,

I'm facing an old problem again.

I have some jobs in Q status but with a wn already assigned:

# qstat -f 188497
Job Id: 188497.pbs01.pic.es
    Job_Name = STDIN
    Job_Owner = iatprd004 at ifaece01.pic.es
    job_state = Q
    queue = glong_sl5
    server = pbs01.pic.es
    Checkpoint = n
    ctime = Thu Mar  4 12:08:55 2010
    Error_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i1
	8276/batch.err
    exec_host = td133.pic.es/6
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Thu Mar  4 17:21:45 2010
    Output_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i
	18276/batch.out
    Priority = 0
    qtime = Thu Mar  4 12:08:55 2010
    Rerunable = False
    Resource_List.cput = 48:00:00
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 72:00:00
    Shell_Path_List = /bin/sh
    stagein = globus-cache-export.i18276.gpg at ifaece01.pic.es:/home/iatprd004/.
	lcgjm/globus-cache-export.i18276/globus-cache-export.i18276.gpg
    substate = 16
    Variable_List = PBS_O_HOME=/home/iatprd004,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=iatprd004,
	PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
	glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
	in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
	PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=ifaece01.pic.es,PBS_O_HOST=ifaece01.pic.es,
	PBS_O_WORKDIR=/home/iatprd004,
	X509_USER_PROXY=/home/iatprd004/.globus/job/ifaece01.pic.es/10553.126
	7701248/x509_up,
	GLOBUS_REMOTE_IO_URL=/home/iatprd004/.lcgjm/.remote_io_ptr/remote_io_
	file-10553.1267701248,GLOBUS_LOCATION=/opt/globus,
	GLOBUS_GRAM_JOB_CONTACT=https://ifaece01.pic.es:20016/10553/126770124
	8/,GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ifaece01.pic.es:20021/,
	SCRATCH_DIRECTORY=/home/iatprd004/,HOME=/home/iatprd004,
	LOGNAME=iatprd004,PANDA_JSID=Xavier-ES,
	GTAG=http://vobox02.pic.es/PIC-Production-Factory/logs//2010-03-04/if
	aece01.pic.es_2119_jobmanager-lcgpbs-glong_sl5/886439.0.out,
	FACTORYQUEUE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
	GLOBUS_CE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
	PBS_O_QUEUE=glong_sl5
    euser = iatprd004
    egroup = iatprd
    hashname = 188497.pbs0
    queue_rank = 187668
    queue_type = E
    etime = Thu Mar  4 12:08:55 2010
    start_time = Thu Mar  4 12:10:03 2010
    start_count = 1


they're in top of maui queue and seems taht maui is not able to
scheduled other jobs.

Checkjob complains about input file:

# checkjob 188497


checking job 188497

State: Idle
Creds:  user:iatprd004  group:iatprd  class:glong_sl5  qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Thu Mar  4 12:08:55
  (Time Queued  Total: 5:16:52  Eligible: 5:11:01)

StartDate: 00:00:01  Thu Mar  4 17:25:48
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc5_x64]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 157
PartitionMask: [ALL]
Reservation '188497' (00:00:01 -> 3:00:00:01  Duration: 3:00:00:00)
Messages:  cannot start job - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match input file stagein location'
PE:  1.00  StartPriority:  463
cannot select job 188497 for partition DEFAULT (startdate in '00:00:01')


grepping client logs:
# grep 188497 /var/spool/pbs/mom_logs/20100304
03/04/2010 12:19:47;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=td133.pic.es MSG=modify job failed, unknown job 188497.pbs01.pic.es), aux=0, type=ModifyJob, from PBS_Server at pbs01.pic.es

and grepping server logs:

# grep 188497 /var/spool/pbs/server_logs/20100304
03/04/2010 12:08:55;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
03/04/2010 12:08:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Queued at request of iatprd004 at ifaece01.pic.es, owner = iatprd004 at ifaece01.pic.es, job name = STDIN, queue = glong_sl5
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Run at request of root at pbs01.pic.es
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;MOM rejected modify request, error: 15001
03/04/2010 12:10:09;0001;PBS_Server;Svr;PBS_Server;Batch protocol error (15031) in send_job, child failed in previous commit request for job 188497.pbs01.pic.es
03/04/2010 12:10:09;0008;PBS_Server;Job;188497.pbs01.pic.es;unable to run job, MOM rejected/rc=1
03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:01:57;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
03/04/2010 16:01:57;0086;PBS_Server;Job;188497.pbs01.pic.es;Requeueing job, substate: 16 Requeued in queue: glong_sl5
03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es



Seems that stagein was not completly done and the job is out of control.
Usually, those jobs become in W status, but now seems that this does
not happen, they remain in Q.

when I kill them by hand, and restart maui, things start working again.

# rpm -qa|grep torque
torque-2.3.0-snap.200801151629.2cri.slc4
torque-server-2.3.0-snap.200801151629.2cri.slc4
torque-client-2.3.0-snap.200801151629.2cri.slc4


anyone faced it before? any clue on what's happening and how to solve
it?

TIA,
Arnau


More information about the torqueusers mailing list