[Mauiusers] jobs assigned to same cores in the same node?

Fernando Caba fcaba at uns.edu.ar
Tue Nov 8 15:50:59 MST 2011


Estimado Hector, gracias por tu pronta respuesta.
El problema es que cuando en el cluster hubo actividad de varios jobs, 
un usuario largó primero un cálculo en un job y luego de unas 5 horas 
largó otro. El problema fue que ambos jobs fueron a parar a el mismo 
nodo y los mismos cores.

Parte del comando qstat:

[root at fe ~]# qstat -f 477
Job Id: 477.fe
     Job_Name = job_gr_PBE
     Job_Owner = matias at fe
     job_state = Q
     queue = batch
     server = fe
     Checkpoint = u
     ctime = Tue Nov  8 11:58:16 2011
     Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e477
     exec_host = n10/3+n10/2+n10/1+n10/0
     exec_port = 15003+15003+15003+15003

[root at fe ~]# qstat -f 480
Job Id: 480.fe
     Job_Name = job_gr_PBE
     Job_Owner = matias at fe
     job_state = Q
     queue = batch
     server = fe
     Checkpoint = u
     ctime = Tue Nov  8 17:26:09 2011
     Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e480
     exec_host = n10/3+n10/2+n10/1+n10/0
     exec_port = 15003+15003+15003+15003

esto me da el comando tracejob para ambos job:
[root at fe ~]# tracejob 480
/var/spool/torque/mom_logs/20111108: No such file or directory
/var/spool/torque/sched_logs/20111108: No such file or directory

Job: 480.fe

11/08/2011 17:26:09  S    enqueuing into batch, state 1 hop 1
11/08/2011 17:26:09  S    Job Queued at request of matias at fe, owner = 
matias at fe,
                           job name = job_gr_PBE, queue = batch
11/08/2011 17:26:09  A    queue=batch
11/08/2011 17:26:10  S    Job Run at request of root at fe
11/08/2011 17:26:12  S    unable to run job, MOM rejected/rc=2
11/08/2011 18:26:36  S    Job Run at request of root at fe
11/08/2011 18:26:38  S    unable to run job, MOM rejected/rc=2
11/08/2011 19:26:45  S    Job Run at request of root at fe
11/08/2011 19:26:45  S    Not sending email: User does not want mail of this
                           type.
11/08/2011 19:26:45  A    user=matias group=matias jobname=job_gr_PBE
                           queue=batch ctime=1320783969 qtime=1320783969
                           etime=1320783969 start=1320791205 owner=matias at fe
                           exec_host=n11/7+n11/6+n11/5+n11/4
                           Resource_List.neednodes=1:ppn=4 
Resource_List.nodect=1
                           Resource_List.nodes=1:ppn=4
                           Resource_List.walltime=2400:00:00
11/08/2011 19:26:53  S    Not sending email: User does not want mail of this
                           type.
11/08/2011 19:26:53  S    Exit_status=0 resources_used.cput=00:00:27
                           resources_used.mem=0kb resources_used.vmem=0kb
                           resources_used.walltime=00:00:09
11/08/2011 19:26:53  A    user=matias group=matias jobname=job_gr_PBE
                           queue=batch ctime=1320783969 qtime=1320783969
                           etime=1320783969 start=1320791205 owner=matias at fe
                           exec_host=n11/7+n11/6+n11/5+n11/4
                           Resource_List.neednodes=1:ppn=4 
Resource_List.nodect=1
                           Resource_List.nodes=1:ppn=4
                           Resource_List.walltime=2400:00:00 session=8035
                           end=1320791213 Exit_status=0
                           resources_used.cput=00:00:27 
resources_used.mem=0kb
                           resources_used.vmem=0kb
                           resources_used.walltime=00:00:09
11/08/2011 19:31:53  S    dequeuing from batch, state COMPLETE
[root at fe ~]#
[root at fe ~]# tracejob 477
/var/spool/torque/mom_logs/20111108: No such file or directory
/var/spool/torque/sched_logs/20111108: No such file or directory

Job: 477.fe

11/08/2011 11:58:16  S    enqueuing into batch, state 1 hop 1
11/08/2011 11:58:16  S    Job Queued at request of matias at fe, owner = 
matias at fe,
                           job name = job_gr_PBE, queue = batch
11/08/2011 11:58:16  A    queue=batch
11/08/2011 11:58:17  S    Job Run at request of root at fe
11/08/2011 11:58:19  S    unable to run job, MOM rejected/rc=2
11/08/2011 12:58:34  S    Job Run at request of root at fe
11/08/2011 12:58:36  S    unable to run job, MOM rejected/rc=2
11/08/2011 13:58:37  S    Job Run at request of root at fe
11/08/2011 13:58:39  S    unable to run job, MOM rejected/rc=2
11/08/2011 14:58:43  S    Job Run at request of root at fe
11/08/2011 14:58:45  S    unable to run job, MOM rejected/rc=2
11/08/2011 15:59:09  S    Job Run at request of root at fe
11/08/2011 15:59:11  S    unable to run job, MOM rejected/rc=2
11/08/2011 16:59:30  S    Job Run at request of root at fe
11/08/2011 16:59:32  S    unable to run job, MOM rejected/rc=2
11/08/2011 17:59:50  S    Job Run at request of root at fe
11/08/2011 17:59:52  S    unable to run job, MOM rejected/rc=2
11/08/2011 19:00:02  S    Job Run at request of root at fe
11/08/2011 19:00:02  S    Not sending email: User does not want mail of this
                           type.
11/08/2011 19:00:02  A    user=matias group=matias jobname=job_gr_PBE
                           queue=batch ctime=1320764296 qtime=1320764296
                           etime=1320764296 start=1320789602 owner=matias at fe
                           exec_host=n11/7+n11/6+n11/5+n11/4
                           Resource_List.neednodes=1:ppn=4 
Resource_List.nodect=1
                           Resource_List.nodes=1:ppn=4
                           Resource_List.walltime=2400:00:00
11/08/2011 19:00:10  S    Not sending email: User does not want mail of this
                           type.
11/08/2011 19:00:10  S    Exit_status=0 resources_used.cput=00:00:27
                           resources_used.mem=0kb resources_used.vmem=0kb
                           resources_used.walltime=00:00:09
11/08/2011 19:00:10  A    user=matias group=matias jobname=job_gr_PBE
                           queue=batch ctime=1320764296 qtime=1320764296
                           etime=1320764296 start=1320789602 owner=matias at fe
                           exec_host=n11/7+n11/6+n11/5+n11/4
                           Resource_List.neednodes=1:ppn=4 
Resource_List.nodect=1
                           Resource_List.nodes=1:ppn=4
                           Resource_List.walltime=2400:00:00 session=7936
                           end=1320789610 Exit_status=0
                           resources_used.cput=00:00:27 
resources_used.mem=0kb
                           resources_used.vmem=0kb
                           resources_used.walltime=00:00:09
11/08/2011 19:05:11  S    dequeuing from batch, state COMPLETE
[root at fe ~]#

no entiendo porque no están los logs en los directorios 
/var/spool/torque/mom_logs ni /var/spool/torque/sched_logs

Saludos

        Fernando

----------------------------------------------------

Ing. Fernando Caba
Director General de Telecomunicaciones
Universidad Nacional del Sur
http://www.dgt.uns.edu.ar
Tel/Fax: (54)-291-4595166
Tel: (54)-291-4595101 int. 2050
Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
----------------------------------------------------


El 08/11/2011 07:17 PM, Hector Oliver escribió:
> Cual es el estado de los jobs (tracejob #job)??
> los dos te aparecen en el qstat?
> se permite en tu configuración varios jobs a la ves?
>
> On Tue, Nov 8, 2011 at 3:58 PM, Fernando Caba <fcaba at uns.edu.ar 
> <mailto:fcaba at uns.edu.ar>> wrote:
>
>     Hi mauiusers, i have a job that it is assigned to node10, from cores 0
>     to 3 and another job assigned to the same node and to the same
>     identical
>     cores (o to 3)
>     Somebody have any idea what is happening? I have torque-3.0.1 and
>     maui-3.3.1.
>     Thanks
>
>     --
>     ----------------------------------------------------
>     Ing. Fernando Caba
>     Director General de Telecomunicaciones
>     Universidad Nacional del Sur
>     http://www.dgt.uns.edu.ar
>     Tel/Fax: (54)-291-4595166
>     Tel: (54)-291-4595101 int. 2050
>     Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
>     ----------------------------------------------------
>
>     _______________________________________________
>     mauiusers mailing list
>     mauiusers at supercluster.org <mailto:mauiusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20111108/3ec130ad/attachment-0001.html 


More information about the mauiusers mailing list