[torquedev] File not found with heavy PBS use

Luiz Angelo Daros de Luca luizluca at gmail.com
Wed Jul 15 09:14:46 MDT 2009


Hello List,

I'm a Torque PBS user for many year. Currently, I'm using torque 2.1.7
shipped with Mandriva Linux.
I have some process (betweeb 5 and 10) that submits jobs in clicles of 50
jobs and those jobs takes about 30s to run. After submitting them, those
process queries the server to check if the jobs have already runned.

My pbs_server is dying from time to time. The messages I got are:

07/15/2009
01:50:14;0010;PBS_Server;Job;2915530.servidor.pcarga.local;Exit_status=0
resources_used.cput=00:00:09 resources_used.mem=47976kb
resources_used.vmem=89560kb resources_used.walltime=00:00:11
07/15/2009 01:50:14;0008;PBS_Server;Job;2915534.servidor.pcarga.local;Job
Modified at request of Scheduler at servidor.pcarga.local
07/15/2009 01:50:14;0008;PBS_Server;Job;2915534.servidor.pcarga.local;Job
Run at request of Scheduler at servidor.pcarga.local
07/15/2009
01:50:22;0010;PBS_Server;Job;2915532.servidor.pcarga.local;Exit_status=0
resources_used.cput=00:00:08 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:10
07/15/2009 01:50:22;0008;PBS_Server;Job;2915535.servidor.pcarga.local;Job
Modified at request of Scheduler at servidor.pcarga.local
07/15/2009 01:50:22;0008;PBS_Server;Job;2915535.servidor.pcarga.local;Job
Run at request of Scheduler at servidor.pcarga.local
07/15/2009
01:50:22;0010;PBS_Server;Job;2915534.servidor.pcarga.local;Exit_status=0
resources_used.cput=00:00:06 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:08
07/15/2009 01:50:22;0008;PBS_Server;Job;2915536.servidor.pcarga.local;Job
Modified at request of Scheduler at servidor.pcarga.local
07/15/2009 01:50:22;0008;PBS_Server;Job;2915536.servidor.pcarga.local;Job
Run at request of Scheduler at servidor.pcarga.local
07/15/2009
01:50:23;0010;PBS_Server;Job;2915531.servidor.pcarga.local;Exit_status=0
resources_used.cput=00:00:11 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:12
07/15/2009
01:50:23;0008;PBS_Server;Job;2915531.servidor.pcarga.local;purging job
without checking MOM
07/15/2009 01:50:23;0001;PBS_Server;Svr;PBS_Server;No such file or directory
(2) in job_save, cannot open file
'/var/spool/torque/server_priv/jobs/2915531.ser.JB' for job
2915531.servidor.pcarga.local in state STAGEDEL (quick)
07/15/2009 01:50:23;0001;PBS_Server;Svr;PBS_Server;No such file or directory
(2) in job_save, cannot open file
'/var/spool/torque/server_priv/jobs/2915531.ser.JB' for job
2915531.servidor.pcarga.local in state EXITED (quick)
07/15/2009 01:50:23;0001;PBS_Server;Svr;PBS_Server;No such file or directory
(2) in job_save, cannot open file
'/var/spool/torque/server_priv/jobs/2915531.ser.JB' for job
2915531.servidor.pcarga.local in state COMPLETE (quick)


Isn't it strage that the JOB was purged after finished?
If it is finished, why care to write the JB file?
This are the last lines in log. Each time this happens, pbs_server dies.

I checked the job_recov.c file but this hasn't changed from my version until
the last release. However, I think that the problem is before those calls.
It seems that someone does not noticed that the job is done.

Any clues? If this is known, is there a patch for this?

Thanks,

---
    Luiz Angelo Daros de Luca, Me.
           luizluca at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20090715/5622e3e9/attachment.html 


More information about the torquedev mailing list