[torqueusers] No such process found

Alexander Fitterling alexander.fitterling at rrz.uni-hamburg.de
Tue Jul 13 02:49:49 MDT 2010


It sound wired, but a termination of one sub job kills the other sub  
jobs remaining on this node(s). Has anyone found any workaround for  
this?

Alexander Fitterling



Zitat von Alexander Fitterling <alexander.fitterling at rrz.uni-hamburg.de>:

> Dear torque users,
>
> I tried to search the archive for a similar case but couldn't find any
> say at least in the last 10 months...
> I wonder about torques behavior (version: 2.3.6) running job arrays.
> To say running large arrays, e.g. qsub -t1-200 batchfile, results in
> an error occasionally in the subjobs (see below) and reproduceable if
> regarding the arrayjob as one big job. In any of such arrayjobs the
> described error occurs, more or less.
>
> I print some output here from one job simply running one command (e.g.
> sleep XXm) and "hostname" and printing the $PBS_ARRAYID in the end:
>
> ............
> node84
> 123
> node50
> 124
> 32310: No such process
> 32312: No such process
> 32315: No such process
> 32316: No such process
> 32318: No such process
> 32320: No such process
> 32372: No such process
> 32375: No such process
> 32383: No such process
> 32387: No such process
> node50
> 125
> 32310: No such process
> 32312: No such process
> 32372: No such process
> node50
> ...............
>
> We use 8 Cores per machine. An minimum setup of my serial-Queue (where
> jobs should run serial on one core) is as follows:
>
> master:~ # qstat -Qf serial
> Queue: serial
>      queue_type = Execution
>      max_queuable = 20
>      total_jobs = 7
>      state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:7 Exiting:0
>      resources_max.nodect = 4
>      resources_max.walltime = 01:12:00
>      resources_min.nodect = 1
>      mtime = 1278436057
>      resources_assigned.nodect = 28
>      max_user_run = 4
>      enabled = True
>
> I wonder if anyone else has experienced such behavior and can tell me
> something about it. I just check on different nodes, and the
> dependency it has on nodes, e.g. the errors would occurr only on same
> nodes... unfortunately, I am not having completed it so far.
>
> Per tracejob I could got behind that some jobs exiting abnormally:
>
> ..... Exit_status=265 resources_used.cput=00:
> 00:00 resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:
> 00:01 session_id=2698 ....
>
> jobs exiting with 0 show a difference in used memory which the line
>   .... resources_used.vmem=28324kb ... shows.
>
>
> I really would appreciate any help or comments.
>
> sincerely,
> Alexander Fitterling
>
>
>
> --
> High Performance - / Scientific Computing (HPSC)
> Gruppe Zentrale Dienste (ZD)
> Regionales Rechenzentrum der Universität Hamburg (RRZ)
> Schlüterstr. 70
> 20146 Hamburg
>
> E-Mail: alexander.fitterling at rrz.uni-hamburg.de
> Tel.  : +49-(0)40-42838-6355
> Fax   : 040-42838-6270
>
> http://www.rrz.uni-hamburg.de/serversysteme/unix-server/hpc.html
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



--
High Performance - / Scientific Computing (HPSC)
Gruppe Zentrale Dienste (ZD)
Regionales Rechenzentrum der Universität Hamburg (RRZ)
Schlüterstr. 70
20146 Hamburg

E-Mail: alexander.fitterling at rrz.uni-hamburg.de
Tel.  : +49-(0)40-42838-6355
Fax   : 040-42838-6270

http://www.rrz.uni-hamburg.de/serversysteme/unix-server/hpc.html


More information about the torqueusers mailing list