[torqueusers] No such process found

Alexander Fitterling alexander.fitterling at rrz.uni-hamburg.de
Thu Jul 8 09:21:14 MDT 2010


Dear torque users,

I tried to search the archive for a similar case but couldn't find any  
say at least in the last 10 months...
I wonder about torques behavior (version: 2.3.6) running job arrays.  
To say running large arrays, e.g. qsub -t1-200 batchfile, results in  
an error occasionally in the subjobs (see below) and reproduceable if  
regarding the arrayjob as one big job. In any of such arrayjobs the  
described error occurs, more or less.

I print some output here from one job simply running one command (e.g.  
sleep XXm) and "hostname" and printing the $PBS_ARRAYID in the end:

............
node84
123
node50
124
32310: No such process
32312: No such process
32315: No such process
32316: No such process
32318: No such process
32320: No such process
32372: No such process
32375: No such process
32383: No such process
32387: No such process
node50
125
32310: No such process
32312: No such process
32372: No such process
node50
...............

We use 8 Cores per machine. An minimum setup of my serial-Queue (where  
jobs should run serial on one core) is as follows:

master:~ # qstat -Qf serial
Queue: serial
     queue_type = Execution
     max_queuable = 20
     total_jobs = 7
     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:7 Exiting:0
     resources_max.nodect = 4
     resources_max.walltime = 01:12:00
     resources_min.nodect = 1
     mtime = 1278436057
     resources_assigned.nodect = 28
     max_user_run = 4
     enabled = True

I wonder if anyone else has experienced such behavior and can tell me  
something about it. I just check on different nodes, and the  
dependency it has on nodes, e.g. the errors would occurr only on same  
nodes... unfortunately, I am not having completed it so far.

Per tracejob I could got behind that some jobs exiting abnormally:

..... Exit_status=265 resources_used.cput=00:
00:00 resources_used.mem=0kb resources_used.vmem=0kb  
resources_used.walltime=00:
00:01 session_id=2698 ....

jobs exiting with 0 show a difference in used memory which the line
  .... resources_used.vmem=28324kb ... shows.


I really would appreciate any help or comments.

sincerely,
Alexander Fitterling



--
High Performance - / Scientific Computing (HPSC)
Gruppe Zentrale Dienste (ZD)
Regionales Rechenzentrum der Universität Hamburg (RRZ)
Schlüterstr. 70
20146 Hamburg

E-Mail: alexander.fitterling at rrz.uni-hamburg.de
Tel.  : +49-(0)40-42838-6355
Fax   : 040-42838-6270

http://www.rrz.uni-hamburg.de/serversysteme/unix-server/hpc.html


More information about the torqueusers mailing list