[torqueusers] jobs queued with WN assigned
Arnau Bria
arnaubria at pic.es
Thu Mar 4 09:41:10 MST 2010
Hi all,
I'm facing an old problem again.
I have some jobs in Q status but with a wn already assigned:
# qstat -f 188497
Job Id: 188497.pbs01.pic.es
Job_Name = STDIN
Job_Owner = iatprd004 at ifaece01.pic.es
job_state = Q
queue = glong_sl5
server = pbs01.pic.es
Checkpoint = n
ctime = Thu Mar 4 12:08:55 2010
Error_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i1
8276/batch.err
exec_host = td133.pic.es/6
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Thu Mar 4 17:21:45 2010
Output_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i
18276/batch.out
Priority = 0
qtime = Thu Mar 4 12:08:55 2010
Rerunable = False
Resource_List.cput = 48:00:00
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 72:00:00
Shell_Path_List = /bin/sh
stagein = globus-cache-export.i18276.gpg at ifaece01.pic.es:/home/iatprd004/.
lcgjm/globus-cache-export.i18276/globus-cache-export.i18276.gpg
substate = 16
Variable_List = PBS_O_HOME=/home/iatprd004,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=iatprd004,
PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
PBS_SERVER=ifaece01.pic.es,PBS_O_HOST=ifaece01.pic.es,
PBS_O_WORKDIR=/home/iatprd004,
X509_USER_PROXY=/home/iatprd004/.globus/job/ifaece01.pic.es/10553.126
7701248/x509_up,
GLOBUS_REMOTE_IO_URL=/home/iatprd004/.lcgjm/.remote_io_ptr/remote_io_
file-10553.1267701248,GLOBUS_LOCATION=/opt/globus,
GLOBUS_GRAM_JOB_CONTACT=https://ifaece01.pic.es:20016/10553/126770124
8/,GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ifaece01.pic.es:20021/,
SCRATCH_DIRECTORY=/home/iatprd004/,HOME=/home/iatprd004,
LOGNAME=iatprd004,PANDA_JSID=Xavier-ES,
GTAG=http://vobox02.pic.es/PIC-Production-Factory/logs//2010-03-04/if
aece01.pic.es_2119_jobmanager-lcgpbs-glong_sl5/886439.0.out,
FACTORYQUEUE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
GLOBUS_CE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
PBS_O_QUEUE=glong_sl5
euser = iatprd004
egroup = iatprd
hashname = 188497.pbs0
queue_rank = 187668
queue_type = E
etime = Thu Mar 4 12:08:55 2010
start_time = Thu Mar 4 12:10:03 2010
start_count = 1
they're in top of maui queue and seems taht maui is not able to
scheduled other jobs.
Checkjob complains about input file:
# checkjob 188497
checking job 188497
State: Idle
Creds: user:iatprd004 group:iatprd class:glong_sl5 qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Thu Mar 4 12:08:55
(Time Queued Total: 5:16:52 Eligible: 5:11:01)
StartDate: 00:00:01 Thu Mar 4 17:25:48
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc5_x64]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 157
PartitionMask: [ALL]
Reservation '188497' (00:00:01 -> 3:00:00:01 Duration: 3:00:00:00)
Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match input file stagein location'
PE: 1.00 StartPriority: 463
cannot select job 188497 for partition DEFAULT (startdate in '00:00:01')
grepping client logs:
# grep 188497 /var/spool/pbs/mom_logs/20100304
03/04/2010 12:19:47;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=td133.pic.es MSG=modify job failed, unknown job 188497.pbs01.pic.es), aux=0, type=ModifyJob, from PBS_Server at pbs01.pic.es
and grepping server logs:
# grep 188497 /var/spool/pbs/server_logs/20100304
03/04/2010 12:08:55;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
03/04/2010 12:08:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Queued at request of iatprd004 at ifaece01.pic.es, owner = iatprd004 at ifaece01.pic.es, job name = STDIN, queue = glong_sl5
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Run at request of root at pbs01.pic.es
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;MOM rejected modify request, error: 15001
03/04/2010 12:10:09;0001;PBS_Server;Svr;PBS_Server;Batch protocol error (15031) in send_job, child failed in previous commit request for job 188497.pbs01.pic.es
03/04/2010 12:10:09;0008;PBS_Server;Job;188497.pbs01.pic.es;unable to run job, MOM rejected/rc=1
03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:01:57;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
03/04/2010 16:01:57;0086;PBS_Server;Job;188497.pbs01.pic.es;Requeueing job, substate: 16 Requeued in queue: glong_sl5
03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
Seems that stagein was not completly done and the job is out of control.
Usually, those jobs become in W status, but now seems that this does
not happen, they remain in Q.
when I kill them by hand, and restart maui, things start working again.
# rpm -qa|grep torque
torque-2.3.0-snap.200801151629.2cri.slc4
torque-server-2.3.0-snap.200801151629.2cri.slc4
torque-client-2.3.0-snap.200801151629.2cri.slc4
anyone faced it before? any clue on what's happening and how to solve
it?
TIA,
Arnau
More information about the torqueusers
mailing list