[torqueusers] momctl error - A description

michael young mhyoung at valdosta.edu
Tue Mar 6 17:28:05 MST 2007


Garrick Staples wrote:

>On Tue, Mar 06, 2007 at 06:14:18PM -0500, michael young alleged:
>  
>
>>The thing that bothers me is that there is a job running currently.
>>Should it show up with 'qstat'?
>>    
>>
>
>Yes, if there is a job running then it shows up in 'qstat'.  If it
>doesn't, then there isn't a job.
>
>  
>
>>>The jobid is first printed to the user when running qsub, and running
>>>'qstat' with no arguments lists all jobs with their ids.
>>>
>>>      
>>>
>>Our users use a GUI interface to submit jobs from Spartan '04. It does 
>>not return the jobid.
>>    
>>
>
>Then I suspect it is just directly running processes and not actually
>talking to TORQUE.
>
>Do you see any activity in pbs_server's logfile indicating job
>submissions?
>  
>

I submitted a job from Spartan and the logs did not show anything.
Then, I ran 'echo "sleep 30" | qsub'. It gave me the same error massage 
but it did show up in the logs.
Here's the logs entries.

######start logs##########
03/06/2007 18:51:41;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from root at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 18:51:41;0100;PBS_Server;Req;;Type QueueJob request received 
from root at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 18:51:41;0080;PBS_Server;Req;req_reject;Reject reply 
code=15023(Bad UID for job execution), aux=0, type=QueueJob, from 
root at cluster.chemistry.valdosta.edu
########end logs###########

How can I run a job like 'echo "sleep 30" | qsub' that will work?


>You might also be doing this backwards.  Try running an interactive job
>first, and then running Spartan from within the job.
>
>
>  
>
>>>>>Since your master node isn't running pbs_mom, this implies that the
>>>>>problem is in your job script.  Is your job script using $PBS_NODEFILE
>>>>>to spawn the processes?
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>Where do I find the job script?
>>>>I did a 'env' and there is no "$PBS_NODEFILE"
>>>>  
>>>>
>>>>        
>>>>
>>>Inside of the job environment, the job will have the list of nodes
>>>assigned to the job in the file named in $PBS_NODEFILE.
>>>      
>>>
>
>Jobs consists of a user-written job script that does what the user
>wants and is submitted to TORQUE with qsub.  TORQUE will then send the
>job to a node where the job script is executed.
>
>
>  
>
>>How do I get to this job environment?
>>
>>In my reading on this, a doc. said to run "echo "sleep 30" | qsub" to 
>>give me a second job.
>>It returns "qsub: Bad UID for job execution".
>>    
>>
>
>Are you running qsub on the same host that is running pbs_server, or a
>different host?  You aren't running it as root, right?
>
>  
>
yes, on the same host as pbs_server.
I was trying to run as root, I look in /home and saw a user called 
'spartan'.
I su 'd to it and reran it.
This time it returned "253.cluster.chemistry.valdosta.edu"
Then, I ran 'qstat'. Still nothing.
Then, I ran 'qstat -n 253' and it returned "stat: Unknown Job Id 
253.cluster.chemistry.valdosta.edu"
Running 'ps ax' shows no job running either.
Here are the logs from it.

##########start logs###########
03/06/2007 19:24:42;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 19:24:42;0100;PBS_Server;Req;;Type StatusJob request received 
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type StatusServer request 
received from spartan at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type StatusJob request received 
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 19:24:48;0080;PBS_Server;Req;req_reject;Reject reply 
code=15001(Unknown Job Id), aux=0, type=StatusJob, from 
spartan at cluster.chemistry.valdosta.edu
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from spartan at cluster.chemistry.valdosta.edu, sock=12
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type LocateJob request received 
from spartan at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 19:24:48;0080;PBS_Server;Req;req_reject;Reject reply 
code=15001(Unknown Job Id), aux=0, type=LocateJob, from 
spartan at cluster.chemistry.valdosta.edu
#############end logs############


The log files also show this, every 32 seconds
#########start logs#########
03/06/2007 19:24:36;0100;PBS_Server;Req;;Type StatusNode request 
received from root at cluster.chemistry.valdosta.edu, sock=9
03/06/2007 19:24:36;0100;PBS_Server;Req;;Type StatusQueue request 
received from root at cluster.chemistry.valdosta.edu, sock=9
03/06/2007 19:24:36;0100;PBS_Server;Req;;Type StatusJob request received 
from root at cluster.chemistry.valdosta.edu, sock=9
#########end logs##################

> 
>  
>
>>>For example, if you launching an MPI program with mpirun, then you would
>>>pass the nodes with something like:
>>>
>>>np=`wc -l < $PBS_NODEFILE`
>>>mpirun -machinefile $PBS_NODEFILE -np $np ./command
>>>
>>>
>>>      
>>>
>>Does Linux come with a MPI program I can run or do I d/l 1 or make 1 or 
>>what?
>>Sorry, I'm really new to this whole clustering business.
>>I do know Linux fairly well though.
>>    
>>
>
>If you aren't currently running MPI programs, then there is no point in
>getting an MPI implementation.  I was just using it as an example of a
>common type of job that could be run within PBS.
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>  
>


More information about the torqueusers mailing list