Container Troubleshooting¶

This document summarizes common issues, error messages, and solutions for Container Support.

Logs and Diagnostics¶

When encountering issues with Container Support, it is recommended to collect the following logs and diagnostic information to assist troubleshooting:

View Supervisor Logs¶

ls /var/crane/supervisor/
cat /var/crane/supervisor/JOBID.STEPID.log

Verify CRI Connection¶

Note

Administrator access only.

# Test with crictl
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock version
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock images

Check Container Runtime Status¶

systemctl status containerd
journalctl -u containerd -f

Common Errors and Solutions¶

Container Feature Not Enabled¶

Symptom

Error: Failed to pull image "myimage:latest" for ...

Cause

Cluster administrator has not enabled container feature in config.yaml.

Solution

Contact administrator to enable container feature following Container Deployment:

Container:
  Enabled: true
  RuntimeEndpoint: /run/containerd/containerd.sock

Invalid Step ID¶

Symptom

Error: Invalid container ID format

Possible Cause

Container ID format is incorrect.

Solution

Use correct format JOBID.STEPID:

Note

Use ccon ps -a to view container ID list. Step ID 0 is for Daemon Step (Pod), which cannot be operated by users.

# View container IDs
ccon ps -a

# Correct format
ccon logs 123.1
ccon attach 123.1

Cannot Connect to Container¶

Symptom

Error: Failed to attach to container 123.1: connection refused

Possible Causes

Container has exited
Job has been queued too long, CLI connection timeout
CRI runtime (e.g., containerd) not configured to allow remote connections
Network issues

Solutions

Check container status:
```
ccon ps -a
```
If container has exited, check exit reason. If job is still queued, wait for successful scheduling before retrying.
Verify Containerd has enabled remote container connection feature.

Empty Container Logs¶

Symptom

ccon logs 123.1
# No output

Possible Causes

Process inside container is configured to output to file
Container hasn't produced output yet
Logs have been manually cleaned
Log folder is not on shared storage, current node cannot access

Solutions

Check if container failed to start
Check if log files exist on the container's running node
Verify if the process inside container produces output, and if output has been manually cleaned

exec Command Failed¶

Symptom

Error: Command not found in container

Cause

Specified command doesn't exist in container image.

Solution

For example with Bash, some container images don't have Bash, only Sh:

Check available shells:

ccon exec 123.1 /bin/sh -c "which bash sh"

Use /bin/sh instead of /bin/bash:
```
ccon exec -it 123.1 /bin/sh
```

Image Pull Failed¶

Symptom

Error: Failed to pull image "myimage:latest" for ...

Possible Causes

Image name typo
Network cannot access image registry
Private registry requires authentication

Solutions

For private registries, authenticate with ccon login:
```
ccon login registry.example.com
```
Use --pull-policy Never to skip pulling (if image already exists locally on node):
```
ccon -p CPU run --pull-policy Never myimage:latest -- cmd
```

Administrators can use ctr/crictl tools to check CRI runtime's image pull status, for example:

crictl images

Permission Denied Inside Container¶

Symptom

Permission denied: '/data/file'

Possible Cause

User namespace mapping failure prevents container user from accessing mounted host files.

Solutions

Disable user namespace with --userns=false:

ccon -p CPU run --userns=false -v /data:/data myimage -- cmd

System kernel version too low, or mounted directory's filesystem doesn't support ID-Mapped Mounts.
- Ask administrator to upgrade kernel.
- Ensure mounted directory is on a filesystem that supports ID-Mapped Mounts, or ask administrator to configure BindFs solution.

GPU Device Unavailable¶

Symptom

Step submission requested GPU resources, GPU works on bare metal outside container, but GPU devices are not accessible inside container.

Possible Cause

Administrator has not correctly configured container runtime to enable GPU support.

Solution

Please refer to Container Deployment and contact GPU vendor for correct configuration method (e.g., NVIDIA Container Toolkit, Ascend Docker Toolkit, etc.).

Port Binding Failed¶

Symptom

Error: Port 8080 is already in use

Cause

Specified host port is already occupied.

Solution

Use a different port:

ccon -p CPU run -p 8081:80 nginx:latest

Error Code Reference¶

Container operations may return these error codes. See Error Code Reference for the complete list.

Error Code	Description	Possible Causes	Solution
`ERR_CRI_GENERIC`	CRI runtime error	Internal error from CRI runtime (containerd/CRI-O)	Check container runtime logs, verify image exists and container config is correct
`ERR_CRI_DISABLED`	Container support disabled	Cluster has not configured container support	Contact administrator to enable container support (see Container Deployment)
`ERR_CRI_CONTAINER_NOT_READY`	Container not ready	Job is pending or container has not finished starting	Please wait for job to enter Running state, check container status with `ccon ps`
`ERR_CRI_MULTIPLE_NODES`	Multi-node operation unsupported	Attempted unsupported container operation on multi-node step	Some container operations only support single-node steps

Getting Help¶

If the above solutions don't resolve your issue:

Collect diagnostic information:
```
ccon --debug-level trace run ...
```
View detailed errors:
```
ccon inspect 123.1
ccon inspectp 123
```
Check system status:
```
cinfo -N
cqueue -j <job_id>
```
Contact administrator with:
- Command used
- Job ID
- Complete error output
- Relevant Supervisor log excerpts

Container Troubleshooting¶

Logs and Diagnostics¶

View Supervisor Logs¶

Verify CRI Connection¶

Check Container Runtime Status¶

Common Errors and Solutions¶

Container Feature Not Enabled¶

Invalid Step ID¶

Cannot Connect to Container¶

Empty Container Logs¶

exec Command Failed¶

Image Pull Failed¶

Permission Denied Inside Container¶

GPU Device Unavailable¶

Port Binding Failed¶

Error Code Reference¶

Getting Help¶

See Also¶