[torqueusers] jobs completing with processes still running

Michael Robbert mrobbert at mines.edu
Wed May 7 11:09:56 MDT 2008


Some good suggestions so far, but one clarification I have is that this 
is an intermittent problem and not every time. This morning they started 
2 jobs and then had their next three fail. These were all running the 
same code, just some different parameters. The jobs were all in their 
own directory and the submission script was only altered slightly to 
adjust for the new directory. I was able to reproduce the situation with 
an interactive job. I used msub -I to get put onto a node and after 
changing directories as in the submission script, I ran their code and 
after a few seconds it returned "Done.". So the problem does seem to be 
with their code returning, but continuing to run.  Right now I'm 
guessing that something in his environment is getting messed up since he 
seems to be indicating that once it starts failing it continues to fail 
until he logs out and back in again. I'm going to ask for a history the 
next time he sees this.
I would also like to figure out why these processes continue to run 
after this false exit or after a canceljob. The code is being run with 
mpirun and he is using mvapich. We do not have an mpiexec in our mvapich 
path. I know that mpirun works fine for OpenMPI, and OpenMPI has an 
mpiexec. They are currently seeing huge speed advantages with mvapich so 
until we work out any issues with OpenMPI and their code I can't tell 
them to use OpenMPI. The command line is not being backgrounded, but it 
is redirecting output to a file in the working directory. That shouldn't 
affect the running at all, should it?

Thanks,
Mike Robbert
Colorado School of Mines

Smith, Jerry Don II wrote:
> What job launch mechanism is the cluster using, mpirun mpiexec?  You want to make sure it is one the resource manager can track processes with to ensure cleanup.
> You can look in $PBS_HOME/server_priv/accounting and grep for the node in order to get the job_id. There is also a setting in moab and torque to hold completed job info for X time, so you can use checkjob and or qstat -f to do further diagnosis. I am not at my desk so I can't tell you the parameter offhand.
>
> Good luck,
>
> Jerry Smith
>
> ----- Original Message -----
> From: torqueusers-bounces at supercluster.org <torqueusers-bounces at supercluster.org>
> To: Michael Robbert <mrobbert at mines.edu>
> Cc: Torque Users <torqueusers at supercluster.org>
> Sent: Wed May 07 09:11:36 2008
> Subject: Re: [torqueusers] jobs completing with processes still running
>
> Hi Mike,
>         If it were me I would try running the program on one of the nodes
> that isn't allocated. You could even create a system reservation so
> no jobs were to get scheduled on the node while you test the
> application. By running the program yourself you eliminate pbs/moab
> from the problem. Once you are certain the application works as
> expected then I would start using the queue system again to see if
> your problem gets introduced there. Hope this helps =).
>
> -Steve
>
>
>
> On May 6, 2008, at 11:49 AM, Michael Robbert wrote:
>
>   
>> I am a new Torque user so be gentle. We are running Moab 5.2.1 and
>> Torque 2.3.0 and we have a user that is submitting jobs (user
>> compiled CHARMM if that matters) and often their jobs are returning
>> within a few seconds and the only data in their Output/Error file
>> is "Done.". The job disappears from the queue as would be expected,
>> but the problem is that their code is still running on all cores of
>> all nodes that they started it on. I run "mdiag -n" and see these
>> nodes show up as idle but load is HIGH.
>> Are there any ideas of what could cause this to happen? Should I be
>> looking at their code? We only have a few users so far, but so far
>> theirs is the only code doing this. What commands in Moab or Torque
>> should I be using to detect and solve this issue? So far it is just
>> mdiag and communication with the user. Specifically how can I find
>> out the jobid of a job that has completed given that I know what
>> nodes it was running on? And then how can I peer into the guts of
>> that job to find out what it was doing?
>> I don't expect to get an answer to the problem, but hope I can find
>> out how to research it. I have searched the list archives and have
>> been trying to read as much documentation as I can, but I'm still
>> stumped. I just signed up for MoabCon so hopefully I'll be an
>> expert after that.
>>
>> Thanks for any suggestions,
>> Mike Robbert
>> Colorado School of Mines
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>     
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>   


More information about the torqueusers mailing list