[torqueusers] Checkpoint script failed with return value of 13
sm4082 at nyu.edu
Tue Jan 31 08:32:40 MST 2012
When I try to checkpoint a simple job I see the error
Checkpoint script failed with return value of 13
in qstat -f output.
I see this in system messages
Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13
Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13
I found this in checkpoint_script.
# Note also that a request was made to identify whether this script was invoked
# by the job's owner or by a system administrator. While this information is
# known to pbs_server, it is not propagated to pbs_mom and thus it is not
# possible to pass this to the script. Therefore, a workaround is to invoke
# qmgr and attempt to set a trivial variable. This will fail if the invoker is
# not a manager.
Anyone know what exactly do I need to do here? I am not sure what trivial variable I need to set wtih qmgr.
Our Server Attributes:
# Set server attributes.
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = crunch.its.nyu.edu
set server acl_hosts += crunch.local
set server managers = root at crunch.local
set server operators = root at crunch.local
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server submit_hosts = login-0-1
set server submit_hosts += login-0-0
set server submit_hosts += login-0-3
set server submit_hosts += login-0-2
set server allow_node_submit = False
set server next_job_number = 139165
If anyone knows how to get around this error, please let me know. I'd appreciate your help.
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110
More information about the torqueusers