[torqueusers] node bad state
Garrick Staples
garrick at usc.edu
Wed Nov 30 03:43:56 MST 2005
On Wed, Nov 30, 2005 at 05:18:46AM -0500, Ghislain ESCORNE alleged:
> Hello,
> I have a problem when I try to submit many jobs which need to run on
> more than one node.
What exactly is the problem? These emails have had a wealth of
information, but I'm having troubling grasping the actual observed
problem.
> ---------------------------script--------------------------------------
> #PBS -l nodes=2:ppn=2,walltime=00:05:00
> ###PBS -m abe
> # Ca c'est bon
> echo `cat $PBS_NODEFILE`
> # Ca c'est pas bon
> echo test : $PBS_NODEFILE
> echo $PBS_O_WORKDIR
> #echo $PBS_WORKDIR
> echo $PBS_JOBID
>
> ----------------------------------------------------------------
> [root at rock-lgit server_logs]# pbsnodes -a
> compute-0-0.local
> state = free
> np = 2
> ntype = cluster
> status = opsys=linux,uname=Linux compute-0-0.local
> 2.6.9-5.0.5.ELsmp #1 SMP Wed Apr 20 00:16:40 BST 2005 i686,sessions=?
> 0,nsessions=?
> 0,nusers=0,idletime=498839,totmem=8250696kb,availmem=8090004kb,physmem=4154132kb,ncpus=4,loadave=0.00,netload=4095556878,state=free,jobs=?
> 0,rectime=1133345212
>
> compute-0-1.local
> state = free
> np = 2
> ntype = cluster
> status = opsys=linux,uname=Linux compute-0-1.local
> 2.6.9-5.0.5.ELsmp #1 SMP Wed Apr 20 00:16:40 BST 2005 i686,sessions=?
> 0,nsessions=?
> 0,nusers=0,idletime=61132,totmem=8250696kb,availmem=8136580kb,physmem=4154132kb,ncpus=4,loadave=0.04,netload=4168023640,state=free,jobs=?
> 0,rectime=1133345189
>
> ------------------------------------------ log
> pbs_server-------------------------------------
> 11/30/2005
> 11:06:46;0009;PBS_Server;Job;7.rock-lgit.obs.ujf-grenoble.fr;obit
> received for job 7.rock-lgit.obs.ujf-grenoble.fr from host
> compute-0-1.local with bad state (state: QUEUED)
> 11/30/2005 11:06:46;0080;PBS_Server;Req;req_reject;Reject reply
> code=15016(Request invalid for state of job), aux=0, type=JobObituary,
> from pbs_mom at compute-0-1.local
> 11/30/2005
> 11:06:46;0008;PBS_Server;Job;7.rock-lgit.obs.ujf-grenoble.fr;MOM
> rejected modify request, error: 15001
> 11/30/2005 11:06:46;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=ModifyJob, from
> root at rock-lgit.obs.ujf-grenoble.fr
> ------------------------------------------------------------
>
> root at rock-lgit server_logs]# checkjob 7
>
>
> checking job 7
>
> State: Running
> Creds: user:gescorne group:1110 class:short_mpi qos:DEFAULT
> WallTime: 00:00:00 of 00:05:00
> SubmitTime: Wed Nov 30 11:05:27
> (Time Queued Total: 00:05:51 Eligible: 00:05:51)
>
> StartTime: Wed Nov 30 11:11:18
> Total Tasks: 4
>
> Req[0] TaskCount: 4 Partition: DEFAULT
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> Allocated Nodes:
> [compute-0-1.local:2][compute-0-0.local:2]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 351
> PartitionMask: [ALL]
> Flags: RESTARTABLE
>
> Reservation '7' (00:00:00 -> 00:05:00 Duration: 00:05:00)
> PE: 4.00 StartPriority: 5
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051130/d727b095/attachment.bin
More information about the torqueusers
mailing list