[torqueusers] momctl error - A description
michael young
mhyoung at valdosta.edu
Tue Mar 6 17:28:05 MST 2007
Garrick Staples wrote:
>On Tue, Mar 06, 2007 at 06:14:18PM -0500, michael young alleged:
>
>
>>The thing that bothers me is that there is a job running currently.
>>Should it show up with 'qstat'?
>>
>>
>
>Yes, if there is a job running then it shows up in 'qstat'. If it
>doesn't, then there isn't a job.
>
>
>
>>>The jobid is first printed to the user when running qsub, and running
>>>'qstat' with no arguments lists all jobs with their ids.
>>>
>>>
>>>
>>Our users use a GUI interface to submit jobs from Spartan '04. It does
>>not return the jobid.
>>
>>
>
>Then I suspect it is just directly running processes and not actually
>talking to TORQUE.
>
>Do you see any activity in pbs_server's logfile indicating job
>submissions?
>
>
I submitted a job from Spartan and the logs did not show anything.
Then, I ran 'echo "sleep 30" | qsub'. It gave me the same error massage
but it did show up in the logs.
Here's the logs entries.
######start logs##########
03/06/2007 18:51:41;0100;PBS_Server;Req;;Type AuthenticateUser request
received from root at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 18:51:41;0100;PBS_Server;Req;;Type QueueJob request received
from root at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 18:51:41;0080;PBS_Server;Req;req_reject;Reject reply
code=15023(Bad UID for job execution), aux=0, type=QueueJob, from
root at cluster.chemistry.valdosta.edu
########end logs###########
How can I run a job like 'echo "sleep 30" | qsub' that will work?
>You might also be doing this backwards. Try running an interactive job
>first, and then running Spartan from within the job.
>
>
>
>
>>>>>Since your master node isn't running pbs_mom, this implies that the
>>>>>problem is in your job script. Is your job script using $PBS_NODEFILE
>>>>>to spawn the processes?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Where do I find the job script?
>>>>I did a 'env' and there is no "$PBS_NODEFILE"
>>>>
>>>>
>>>>
>>>>
>>>Inside of the job environment, the job will have the list of nodes
>>>assigned to the job in the file named in $PBS_NODEFILE.
>>>
>>>
>
>Jobs consists of a user-written job script that does what the user
>wants and is submitted to TORQUE with qsub. TORQUE will then send the
>job to a node where the job script is executed.
>
>
>
>
>>How do I get to this job environment?
>>
>>In my reading on this, a doc. said to run "echo "sleep 30" | qsub" to
>>give me a second job.
>>It returns "qsub: Bad UID for job execution".
>>
>>
>
>Are you running qsub on the same host that is running pbs_server, or a
>different host? You aren't running it as root, right?
>
>
>
yes, on the same host as pbs_server.
I was trying to run as root, I look in /home and saw a user called
'spartan'.
I su 'd to it and reran it.
This time it returned "253.cluster.chemistry.valdosta.edu"
Then, I ran 'qstat'. Still nothing.
Then, I ran 'qstat -n 253' and it returned "stat: Unknown Job Id
253.cluster.chemistry.valdosta.edu"
Running 'ps ax' shows no job running either.
Here are the logs from it.
##########start logs###########
03/06/2007 19:24:42;0100;PBS_Server;Req;;Type AuthenticateUser request
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 19:24:42;0100;PBS_Server;Req;;Type StatusJob request received
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type AuthenticateUser request
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type StatusServer request
received from spartan at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type StatusJob request received
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/06/2007 19:24:48;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id), aux=0, type=StatusJob, from
spartan at cluster.chemistry.valdosta.edu
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type AuthenticateUser request
received from spartan at cluster.chemistry.valdosta.edu, sock=12
03/06/2007 19:24:48;0100;PBS_Server;Req;;Type LocateJob request received
from spartan at cluster.chemistry.valdosta.edu, sock=11
03/06/2007 19:24:48;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id), aux=0, type=LocateJob, from
spartan at cluster.chemistry.valdosta.edu
#############end logs############
The log files also show this, every 32 seconds
#########start logs#########
03/06/2007 19:24:36;0100;PBS_Server;Req;;Type StatusNode request
received from root at cluster.chemistry.valdosta.edu, sock=9
03/06/2007 19:24:36;0100;PBS_Server;Req;;Type StatusQueue request
received from root at cluster.chemistry.valdosta.edu, sock=9
03/06/2007 19:24:36;0100;PBS_Server;Req;;Type StatusJob request received
from root at cluster.chemistry.valdosta.edu, sock=9
#########end logs##################
>
>
>
>>>For example, if you launching an MPI program with mpirun, then you would
>>>pass the nodes with something like:
>>>
>>>np=`wc -l < $PBS_NODEFILE`
>>>mpirun -machinefile $PBS_NODEFILE -np $np ./command
>>>
>>>
>>>
>>>
>>Does Linux come with a MPI program I can run or do I d/l 1 or make 1 or
>>what?
>>Sorry, I'm really new to this whole clustering business.
>>I do know Linux fairly well though.
>>
>>
>
>If you aren't currently running MPI programs, then there is no point in
>getting an MPI implementation. I was just using it as an example of a
>common type of job that could be run within PBS.
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
More information about the torqueusers
mailing list