[torqueusers] jobs queued with WN assigned

Ramon Bastiaans ramon.bastiaans at sara.nl
Tue Mar 16 04:25:57 MDT 2010


I have seen this before. For us it happened when stagein/scp of job 
input files failed during the stagin phase.

Because this happens during stagein, a node is already assigned to the 
job but it never succesfully enters Running state. Then the job is 
corrupt being Queued with a Exec_host assigned. This will cause any 
future scp/stagein attempts to fail since then the scp on a new/other 
node does not match the old host, causing the stagein to fail every 
consecutive try afterward.

I would suggest going through the syslog on the mom nodes for any 
pbs_mom LOG_ERROR's such as these:

pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB 
xxxx at xxxx:/data/home/xx/xxx xxx' failed with status=1, giving up after 4 
attempts
pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file 
xxx at xxx:/data/home/xx/xxx to xxx

Other than making sure your scp/stagein setup works for each host, I 
know of no other solution. It usually means there is something wrong 
with (the configuration of) one of your nodes.


Kind regards,
- Ramon.

On 03/04/2010 05:41 PM, Arnau Bria wrote:
> Hi all,
>
> I'm facing an old problem again.
>
> I have some jobs in Q status but with a wn already assigned:
>
> # qstat -f 188497
> Job Id: 188497.pbs01.pic.es
>      Job_Name = STDIN
>      Job_Owner = iatprd004 at ifaece01.pic.es
>      job_state = Q
>      queue = glong_sl5
>      server = pbs01.pic.es
>      Checkpoint = n
>      ctime = Thu Mar  4 12:08:55 2010
>      Error_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i1
> 	8276/batch.err
>      exec_host = td133.pic.es/6
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Thu Mar  4 17:21:45 2010
>      Output_Path = ifaece01.pic.es:/home/iatprd004/.lcgjm/globus-cache-export.i
> 	18276/batch.out
>      Priority = 0
>      qtime = Thu Mar  4 12:08:55 2010
>      Rerunable = False
>      Resource_List.cput = 48:00:00
>      Resource_List.neednodes = 1
>      Resource_List.nodect = 1
>      Resource_List.nodes = 1
>      Resource_List.walltime = 72:00:00
>      Shell_Path_List = /bin/sh
>      stagein = globus-cache-export.i18276.gpg at ifaece01.pic.es:/home/iatprd004/.
> 	lcgjm/globus-cache-export.i18276/globus-cache-export.i18276.gpg
>      substate = 16
>      Variable_List = PBS_O_HOME=/home/iatprd004,PBS_O_LANG=en_US.UTF-8,
> 	PBS_O_LOGNAME=iatprd004,
> 	PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
> 	glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
> 	in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
> 	PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
> 	PBS_SERVER=ifaece01.pic.es,PBS_O_HOST=ifaece01.pic.es,
> 	PBS_O_WORKDIR=/home/iatprd004,
> 	X509_USER_PROXY=/home/iatprd004/.globus/job/ifaece01.pic.es/10553.126
> 	7701248/x509_up,
> 	GLOBUS_REMOTE_IO_URL=/home/iatprd004/.lcgjm/.remote_io_ptr/remote_io_
> 	file-10553.1267701248,GLOBUS_LOCATION=/opt/globus,
> 	GLOBUS_GRAM_JOB_CONTACT=https://ifaece01.pic.es:20016/10553/126770124
> 	8/,GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ifaece01.pic.es:20021/,
> 	SCRATCH_DIRECTORY=/home/iatprd004/,HOME=/home/iatprd004,
> 	LOGNAME=iatprd004,PANDA_JSID=Xavier-ES,
> 	GTAG=http://vobox02.pic.es/PIC-Production-Factory/logs//2010-03-04/if
> 	aece01.pic.es_2119_jobmanager-lcgpbs-glong_sl5/886439.0.out,
> 	FACTORYQUEUE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
> 	GLOBUS_CE=ifaece01.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
> 	PBS_O_QUEUE=glong_sl5
>      euser = iatprd004
>      egroup = iatprd
>      hashname = 188497.pbs0
>      queue_rank = 187668
>      queue_type = E
>      etime = Thu Mar  4 12:08:55 2010
>      start_time = Thu Mar  4 12:10:03 2010
>      start_count = 1
>
>
> they're in top of maui queue and seems taht maui is not able to
> scheduled other jobs.
>
> Checkjob complains about input file:
>
> # checkjob 188497
>
>
> checking job 188497
>
> State: Idle
> Creds:  user:iatprd004  group:iatprd  class:glong_sl5  qos:ilhcatlas
> WallTime: 00:00:00 of 3:00:00:00
> SubmitTime: Thu Mar  4 12:08:55
>    (Time Queued  Total: 5:16:52  Eligible: 5:11:01)
>
> StartDate: 00:00:01  Thu Mar  4 17:25:48
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory>= 0  Disk>= 0  Swap>= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [slc5_x64]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 157
> PartitionMask: [ALL]
> Reservation '188497' (00:00:01 ->  3:00:00:01  Duration: 3:00:00:00)
> Messages:  cannot start job - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match input file stagein location'
> PE:  1.00  StartPriority:  463
> cannot select job 188497 for partition DEFAULT (startdate in '00:00:01')
>
>
> grepping client logs:
> # grep 188497 /var/spool/pbs/mom_logs/20100304
> 03/04/2010 12:19:47;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=td133.pic.es MSG=modify job failed, unknown job 188497.pbs01.pic.es), aux=0, type=ModifyJob, from PBS_Server at pbs01.pic.es
>
> and grepping server logs:
>
> # grep 188497 /var/spool/pbs/server_logs/20100304
> 03/04/2010 12:08:55;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
> 03/04/2010 12:08:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Queued at request of iatprd004 at ifaece01.pic.es, owner = iatprd004 at ifaece01.pic.es, job name = STDIN, queue = glong_sl5
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Run at request of root at pbs01.pic.es
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 12:10:03;0008;PBS_Server;Job;188497.pbs01.pic.es;MOM rejected modify request, error: 15001
> 03/04/2010 12:10:09;0001;PBS_Server;Svr;PBS_Server;Batch protocol error (15031) in send_job, child failed in previous commit request for job 188497.pbs01.pic.es
> 03/04/2010 12:10:09;0008;PBS_Server;Job;188497.pbs01.pic.es;unable to run job, MOM rejected/rc=1
> 03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:37:39;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:39:40;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:41:42;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:43:53;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:45:54;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:47:55;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:49:56;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:51:57;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:55:59;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 15:58:05;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:00:06;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:01:57;0100;PBS_Server;Job;188497.pbs01.pic.es;enqueuing into glong_sl5, state 1 hop 1
> 03/04/2010 16:01:57;0086;PBS_Server;Job;188497.pbs01.pic.es;Requeueing job, substate: 16 Requeued in queue: glong_sl5
> 03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
> 03/04/2010 16:02:02;0008;PBS_Server;Job;188497.pbs01.pic.es;Job Modified at request of root at pbs01.pic.es
>
>
>
> Seems that stagein was not completly done and the job is out of control.
> Usually, those jobs become in W status, but now seems that this does
> not happen, they remain in Q.
>
> when I kill them by hand, and restart maui, things start working again.
>
> # rpm -qa|grep torque
> torque-2.3.0-snap.200801151629.2cri.slc4
> torque-server-2.3.0-snap.200801151629.2cri.slc4
> torque-client-2.3.0-snap.200801151629.2cri.slc4
>
>
> anyone faced it before? any clue on what's happening and how to solve
> it?
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>    


-- 
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V

SARA - Computing&  Networking Services
Science Park 121     PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5148 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100316/6d36f65e/attachment-0001.bin 


More information about the torqueusers mailing list