[torqueusers] Checkpoint script failed with return value of 13

Sreedhar Manchu sm4082 at nyu.edu
Tue Jan 31 08:32:40 MST 2012


Hi,

When I try to checkpoint a simple job I see the error 

Checkpoint script failed with return value of 13

in qstat -f output. 

I see this in system messages

Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13 
Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13

I found this in checkpoint_script.

# Note also that a request was made to identify whether this script was invoked
# by the job's owner or by a system administrator.  While this information is
# known to pbs_server, it is not propagated to pbs_mom and thus it is not
# possible to pass this to the script.  Therefore, a workaround is to invoke
# qmgr and attempt to set a trivial variable. This will fail if the invoker is
# not a manager.

Anyone know what exactly do I need to do here? I am not sure what trivial variable I need to set wtih qmgr.

Our Server Attributes:

# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = crunch.its.nyu.edu
set server acl_hosts += crunch.local
set server managers = root at crunch.local
set server operators = root at crunch.local
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server submit_hosts = login-0-1
set server submit_hosts += login-0-0
set server submit_hosts += login-0-3
set server submit_hosts += login-0-2
set server allow_node_submit = False
set server next_job_number = 139165

If anyone knows how to get around this error, please let me know. I'd appreciate your help.

Thanks,
Sreedhar.

---
Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110




More information about the torqueusers mailing list