[torqueusers] qstat: End of file
Rudge, Chris M. (Dr.)
cmr9 at leicester.ac.uk
Mon Nov 28 14:45:56 MST 2011
We had a problem today running v2.5.5 (which has been running without issue for some time). A user submitted a very large number of jobs today, many of which have dependencies. The number of jobs submitted during the day was >40,000. We use Moab for the scheduler and have maxjobs set to 32768, which has previously been OK, and my understanding is that the excess of jobs simply remain in torque - again I've seen this happen in the past.
When torque crashed, an attempt was made to clear out the queued jobs by deleting the .SC and .JB files from server_priv/jobs for the jobs which corresponded to those submitted by the user during the day.
The pbs_server process starts correctly and reads in the jobs and targeted qstat commands, such as "qstat -B" or "qstat -u <uid>" all work correctly. However, any attempt to run a "qstat -u ..." for the user who submitted the large number of jobs or to simply run qstat to display all jobs results in the output "qstat: End of file" and the pbs_server process crashes. Also, letting Moab run as scheduler will cause the pbs_server process to crash when it queries torque for all of the job info.
I'm guessing that there's some corruption in the serverdb (but could be wrong about this) given that there's obviously still a record of jobs for the "bad" user given the crash when querying jobs for him. Is this a reasonable conclusion to reach or are there other known reasons for encountering this issue.
If it is a corrupted serverdb, given that there are many jobs running and others queued, is there a way to recover the db without losing the running/queued jobs?
Dr Chris Rudge - Research Computing Services Manager
IT Services, University of Leicester, LE1 7RH
Tel: 0116 2522223
Times Higher Education University of the year 2008/9
More information about the torqueusers