Container Troubleshooting¶
This document summarizes common issues, error messages, and solutions for Container Support.
Logs and Diagnostics¶
When encountering issues with Container Support, it is recommended to collect the following logs and diagnostic information to assist troubleshooting:
View Supervisor Logs¶
Verify CRI Connection¶
Note
Administrator access only.
# Test with crictl
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock version
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock images
Check Container Runtime Status¶
Common Errors and Solutions¶
Container Feature Not Enabled¶
Symptom
Cause
Cluster administrator has not enabled container feature in config.yaml.
Solution
Contact administrator to enable container feature following Container Deployment:
Invalid Step ID¶
Symptom
Possible Cause
Container ID format is incorrect.
Solution
Use correct format JOBID.STEPID:
Note
Use ccon ps -a to view container ID list. Step ID 0 is for Daemon Step (Pod), which cannot be operated by users.
Cannot Connect to Container¶
Symptom
Possible Causes
- Container has exited
- Job has been queued too long, CLI connection timeout
- CRI runtime (e.g., containerd) not configured to allow remote connections
- Network issues
Solutions
-
Check container status:
If container has exited, check exit reason. If job is still queued, wait for successful scheduling before retrying. -
Verify Containerd has enabled remote container connection feature.
Empty Container Logs¶
Symptom
Possible Causes
- Process inside container is configured to output to file
- Container hasn't produced output yet
- Logs have been manually cleaned
- Log folder is not on shared storage, current node cannot access
Solutions
- Check if container failed to start
- Check if log files exist on the container's running node
- Verify if the process inside container produces output, and if output has been manually cleaned
exec Command Failed¶
Symptom
Cause
Specified command doesn't exist in container image.
Solution
For example with Bash, some container images don't have Bash, only Sh:
-
Check available shells:
-
Use
/bin/shinstead of/bin/bash:
Image Pull Failed¶
Symptom
Possible Causes
- Image name typo
- Network cannot access image registry
- Private registry requires authentication
Solutions
-
For private registries, authenticate with
ccon login: -
Use
--pull-policy Neverto skip pulling (if image already exists locally on node):
Administrators can use ctr/crictl tools to check CRI runtime's image pull status, for example:
Permission Denied Inside Container¶
Symptom
Possible Cause
User namespace mapping failure prevents container user from accessing mounted host files.
Solutions
-
Disable user namespace with
--userns=false: -
System kernel version too low, or mounted directory's filesystem doesn't support ID-Mapped Mounts.
- Ask administrator to upgrade kernel.
- Ensure mounted directory is on a filesystem that supports ID-Mapped Mounts, or ask administrator to configure BindFs solution.
GPU Device Unavailable¶
Symptom
Step submission requested GPU resources, GPU works on bare metal outside container, but GPU devices are not accessible inside container.
Possible Cause
Administrator has not correctly configured container runtime to enable GPU support.
Solution
Please refer to Container Deployment and contact GPU vendor for correct configuration method (e.g., NVIDIA Container Toolkit, Ascend Docker Toolkit, etc.).
Port Binding Failed¶
Symptom
Cause
Specified host port is already occupied.
Solution
Use a different port:
Error Code Reference¶
Container operations may return these error codes. See Error Code Reference for the complete list.
| Error Code | Description | Possible Causes | Solution |
|---|---|---|---|
ERR_CRI_GENERIC |
CRI runtime error | Internal error from CRI runtime (containerd/CRI-O) | Check container runtime logs, verify image exists and container config is correct |
ERR_CRI_DISABLED |
Container support disabled | Cluster has not configured container support | Contact administrator to enable container support (see Container Deployment) |
ERR_CRI_CONTAINER_NOT_READY |
Container not ready | Job is pending or container has not finished starting | Please wait for job to enter Running state, check container status with ccon ps |
ERR_CRI_MULTIPLE_NODES |
Multi-node operation unsupported | Attempted unsupported container operation on multi-node step | Some container operations only support single-node steps |
Getting Help¶
If the above solutions don't resolve your issue:
-
Collect diagnostic information:
-
View detailed errors:
-
Check system status:
-
Contact administrator with:
- Command used
- Job ID
- Complete error output
- Relevant Supervisor log excerpts
See Also¶
- Container Deployment - Administrator configuration guide
- Core Concepts - Understanding Pods and container steps
- ccon Command Manual - Complete command reference
- Error Code Reference - System error codes