[torqueusers] node bad state

Garrick Staples garrick at usc.edu
Wed Nov 30 03:43:56 MST 2005


On Wed, Nov 30, 2005 at 05:18:46AM -0500, Ghislain ESCORNE alleged:
> Hello,
> I have a problem when I try to submit many jobs which need to run on 
> more than one node.

What exactly is the problem?  These emails have had a wealth of
information, but I'm having troubling grasping the actual observed
problem.


> ---------------------------script--------------------------------------
> #PBS -l nodes=2:ppn=2,walltime=00:05:00
> ###PBS -m abe
> # Ca c'est bon
> echo `cat $PBS_NODEFILE`
> # Ca c'est pas bon
> echo test : $PBS_NODEFILE
> echo $PBS_O_WORKDIR
> #echo $PBS_WORKDIR
> echo $PBS_JOBID
> 
> ----------------------------------------------------------------
> [root at rock-lgit server_logs]# pbsnodes -a
> compute-0-0.local
>     state = free
>     np = 2
>     ntype = cluster
>     status = opsys=linux,uname=Linux compute-0-0.local 
> 2.6.9-5.0.5.ELsmp #1 SMP Wed Apr 20 00:16:40 BST 2005 i686,sessions=? 
> 0,nsessions=? 
> 0,nusers=0,idletime=498839,totmem=8250696kb,availmem=8090004kb,physmem=4154132kb,ncpus=4,loadave=0.00,netload=4095556878,state=free,jobs=? 
> 0,rectime=1133345212
> 
> compute-0-1.local
>     state = free
>     np = 2
>     ntype = cluster
>     status = opsys=linux,uname=Linux compute-0-1.local 
> 2.6.9-5.0.5.ELsmp #1 SMP Wed Apr 20 00:16:40 BST 2005 i686,sessions=? 
> 0,nsessions=? 
> 0,nusers=0,idletime=61132,totmem=8250696kb,availmem=8136580kb,physmem=4154132kb,ncpus=4,loadave=0.04,netload=4168023640,state=free,jobs=? 
> 0,rectime=1133345189
> 
> ------------------------------------------ log 
> pbs_server-------------------------------------
> 11/30/2005 
> 11:06:46;0009;PBS_Server;Job;7.rock-lgit.obs.ujf-grenoble.fr;obit 
> received for job 7.rock-lgit.obs.ujf-grenoble.fr from host 
> compute-0-1.local with bad state (state: QUEUED)
> 11/30/2005 11:06:46;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
> from pbs_mom at compute-0-1.local
> 11/30/2005 
> 11:06:46;0008;PBS_Server;Job;7.rock-lgit.obs.ujf-grenoble.fr;MOM 
> rejected modify request, error: 15001
> 11/30/2005 11:06:46;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15001(Unknown Job Id), aux=0, type=ModifyJob, from 
> root at rock-lgit.obs.ujf-grenoble.fr
> ------------------------------------------------------------
> 
> root at rock-lgit server_logs]# checkjob 7
> 
> 
> checking job 7
> 
> State: Running
> Creds:  user:gescorne  group:1110  class:short_mpi  qos:DEFAULT
> WallTime: 00:00:00 of 00:05:00
> SubmitTime: Wed Nov 30 11:05:27
>  (Time Queued  Total: 00:05:51  Eligible: 00:05:51)
> 
> StartTime: Wed Nov 30 11:11:18
> Total Tasks: 4
> 
> Req[0]  TaskCount: 4  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Allocated Nodes:
> [compute-0-1.local:2][compute-0-0.local:2]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 351
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '7' (00:00:00 -> 00:05:00  Duration: 00:05:00)
> PE:  4.00  StartPriority:  5
> 
> 
> 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051130/d727b095/attachment.bin


More information about the torqueusers mailing list