[torqueusers] jobs queued with WN assigned
Ramon Bastiaans
ramon.bastiaans at sara.nl
Tue Mar 16 04:25:57 MDT 2010
I have seen this before. For us it happened when stagein/scp of job
input files failed during the stagin phase.
Because this happens during stagein, a node is already assigned to the
job but it never succesfully enters Running state. Then the job is
corrupt being Queued with a Exec_host assigned. This will cause any
future scp/stagein attempts to fail since then the scp on a new/other
node does not match the old host, causing the stagein to fail every
consecutive try afterward.
I would suggest going through the syslog on the mom nodes for any
pbs_mom LOG_ERROR's such as these:
pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
xxxx at xxxx:/data/home/xx/xxx xxx' failed with status=1, giving up after 4
attempts
pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
xxx at xxx:/data/home/xx/xxx to xxx
Other than making sure your scp/stagein setup works for each host, I
know of no other solution. It usually means there is something wrong
with (the configuration of) one of your nodes.
Kind regards,
- Ramon.
On 03/04/2010 05:41 PM, Arnau Bria wrote:
> Hi all,
>
> I'm facing an old problem again.
>
> I have some jobs in Q status but with a wn already assigned:
>
> # qstat -f 188497
> Job Id: 188497.pbs01.pic.es
> Job_Name = STDIN
> Job_Owner = iatprd004 at ifaece01.pic.es
> job_state = Q
> queue = glong_sl5
> server = pbs01.pic.es
> Checkpoint = n
> ctime = Thu Mar 4 12:08:55 2010
> Error_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i1
> 8276/batch.err
> exec_host = td133.pic.es/6
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Thu Mar 4 17:21:45 2010
> Output_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i
> 18276/batch.out
> Priority = 0
> qtime = Thu Mar 4 12:08:55 2010
> Rerunable = False
> Resource_List.cput = 48:00:00
> Resource_List.neednodes = 1
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 72:00:00
> Shell_Path_List = /bin/sh
> stagein = globus-cache-export.i18276.gpg at ifaece01.pic.es:/home/iatprd004/.
> lcgjm/globus-cache-export.i18276/globus-cache-export.i18276.gpg
> substate = 16
> Variable_List = PBS_O_HOME=/home/iatprd004,PBS_O_LANG=en_US.UTF-8,
> PBS_O_LOGNAME=iatprd004,
> PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
> glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
> in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
> PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
> PBS_SERVER=ifaece01.pic.es,PBS_O_HOST=ifaece01.pic.es,
> PBS_O_WORKDIR=/home/iatprd004,
> X509_USER_PROXY=/home/iatprd004/.globus/job/ifaece01.pic.es/10553.126
> 7701248/x509_up,
> GLOBUS_REMOTE_IO_URL=/home/iatprd004/.lcgjm/.remote_io_ptr/remote_io_
> file-10553.1267701248,GLOBUS_LOCATION=/opt/globus,
> GLOBUS_GRAM_JOB_CONTACT=https://ifaece01.pic.es:20016/10553/126770124
> 8/,GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ifaece01.pic.es:20021/,
> SCRATCH_DIRECTORY=/home/iatprd004/,HOME=/home/iatprd004,
> LOGNAME=iatprd004,PANDA_JSID=Xavier-ES,
> GTAG=http://vobox02.pic.es/PIC-Production-Factory/logs//2010-03-04/if
> aece01.pic.es_2119_jobmanager-lcgpbs-glong_sl5/886439.0.out,
> FACTORYQUEUE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
> GLOBUS_CE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
> PBS_O_QUEUE=glong_sl5
> euser = iatprd004
> egroup = iatprd
> hashname = 188497.pbs0
> queue_rank = 187668
> queue_type = E
> etime = Thu Mar 4 12:08:55 2010
> start_time = Thu Mar 4 12:10:03 2010
> start_count = 1
>
>
> they're in top of maui queue and seems taht maui is not able to
> scheduled other jobs.
>
> Checkjob complains about input file:
>
> # checkjob 188497
>
>
> checking job 188497
>
> State: Idle
> Creds: user:iatprd004 group:iatprd class:glong_sl5 qos:ilhcatlas
> WallTime: 00:00:00 of 3:00:00:00
> SubmitTime: Thu Mar 4 12:08:55
> (Time Queued Total: 5:16:52 Eligible: 5:11:01)
>
> StartDate: 00:00:01 Thu Mar 4 17:25:48
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 0
> Opsys: [NONE] Arch: [NONE] Features: [slc5_x64]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 157
> PartitionMask: [ALL]
> Reservation '188497' (00:00:01 -> 3:00:00:01 Duration: 3:00:00:00)
> Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match input file stagein location'
> PE: 1.00 StartPriority: 463
> cannot select job 188497 for partition DEFAULT (startdate in '00:00:01')
>
>
> grepping client logs:
> # grep 188497 /var/spool/pbs/mom_logs/20100304
> 03/04/2010 12:19:47;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=td133.pic.es MSG=modify job failed, unknown job 188497.pbs01.pic.es), aux=0, type=ModifyJob, from PBS_Server at pbs01.pic.es
>
> and grepping server logs:
>
> # grep 188497 /var/spool/pbs/server_logs/20100304
> 03/04/2010 12:08:55;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
> 03/04/2010 12:08:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Queued at request of iatprd004 at ifaece01.pic.es, owner = iatprd004 at ifaece01.pic.es, job name = STDIN, queue = glong_sl5
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Run at request of root at pbs01.pic.es
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;MOM rejected modify request, error: 15001
> 03/04/2010 12:10:09;0001;PBS_Server;Svr;PBS_Server;Batch protocol error (15031) in send_job, child failed in previous commit request for job 188497.pbs01.pic.es
> 03/04/2010 12:10:09;0008;PBS_Server;Job;188497.pbs01.pic.es;unable to run job, MOM rejected/rc=1
> 03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:01:57;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
> 03/04/2010 16:01:57;0086;PBS_Server;Job;188497.pbs01.pic.es;Requeueing job, substate: 16 Requeued in queue: glong_sl5
> 03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
>
>
>
> Seems that stagein was not completly done and the job is out of control.
> Usually, those jobs become in W status, but now seems that this does
> not happen, they remain in Q.
>
> when I kill them by hand, and restart maui, things start working again.
>
> # rpm -qa|grep torque
> torque-2.3.0-snap.200801151629.2cri.slc4
> torque-server-2.3.0-snap.200801151629.2cri.slc4
> torque-client-2.3.0-snap.200801151629.2cri.slc4
>
>
> anyone faced it before? any clue on what's happening and how to solve
> it?
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V
SARA - Computing& Networking Services
Science Park 121 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5148 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100316/6d36f65e/attachment-0001.bin
More information about the torqueusers
mailing list