CraneSched¶
A next-generation open-source compute scheduler for unified HPC and AI workloads.
CraneSched is a distributed job scheduling system jointly developed by Peking University, the Changsha Institute of Computing and Digital Economy of Peking University, and Changsha Jianshan Tatu Technology Co., Ltd. Targeting HPC and AI workloads, it unifies supercomputing and AI computing resources, breaks down traditional compute barriers, and delivers efficient, stable, and reliable computing services across education, industry, meteorology, defense, and more.
Get Started Try the Demo GitHub
Why CraneSched?¶
-
Industry-Leading Performance
Scheduling throughput 5–20x faster than Slurm; real-time scheduling of over 10,000 jobs per second; supports 2,000,000+ concurrent jobs; job dispatch latency under 10 ms. Supports clusters with 100,000+ nodes.
-
Deep HPC+AI Convergence
One cluster handles both HPC and AI workloads, covering all HTC+HPC+AI computing scenarios. Compute, storage, and data resources are pooled for efficient sharing — no more resource silos.
-
Unified Heterogeneous Hardware Management
Supports X86, ARM, and RISC-V architectures; adapts to Intel, AMD, Phytium, and Kunpeng CPUs; supports Nvidia and AMD GPUs as well as Huawei Ascend, Cambricon, and Kunlunxin accelerators.
-
Intelligent Algorithms for Efficiency and Energy Savings
In-house ORA job runtime prediction algorithm (published at CCF-B conference ICS) with 41% accuracy improvement; in-house TSMF algorithm significantly improves resource utilization; in-house EcoSched algorithm reduces cluster energy consumption by 78.64% under low load.
-
Full Slurm/LSF Command Compatibility
In-house Slurm & LSF Wrapper enables zero-cost, transparent migration — users need not modify any scripts or workflows.
-
CraneSched + SCOW Integrated Solution
Deeply integrated with the SCOW computing platform, providing closed-loop lifecycle management spanning resource management, job scheduling, monitoring, and billing — an all-in-one "HPC·AI·Quantum·Cloud" computing service.
Solutions¶
-
Storage·Compute·Usage Convergence for HPC+AI
Traditional approaches deploy HPC and AI clusters independently, making compute, storage, and data sharing difficult and creating resource silos. CraneSched's HPC+AI convergence solution delivers:
- Compute convergence: one cluster handles both HPC and AI workloads
- Storage convergence: unified storage with pooled data resources
- Usage convergence: unified platform simplifies operations, unified user authentication and resource management
-
CraneSched + SCOW Integrated Computing Solution
CraneSched is deeply integrated with SCOW (Super Computing On Web), forming a complete computing center solution:
- Operations management: billing, user management, account management, identity authentication, permission management
- Resource usage: online job submission, shell platform, visual desktop, cross-cluster file transfer
- Resource management: resource virtualization, resource authorization, resource configuration
Feature Completeness¶
CraneSched comprehensively benchmarks against Slurm in scheduling capabilities, and surpasses it in several key areas.
| Feature | CraneSched | Slurm | LSF | Description |
|---|---|---|---|---|
| Backfill Scheduling | Run short jobs in idle time windows to improve utilization | |||
| Fair-Share Scheduling | Fair scheduling policy based on historical usage | |||
| Preemption | High-priority jobs preempt resources from lower-priority ones | |||
| Reservation | Reserve resource time windows for specific users or jobs | |||
| Power Saving Scheduling | Automatically shut down idle nodes under low load | |||
| TRES Fine-Grained Tracking | Trackable resource types (CPU, memory, GPU, etc.) | |||
| Job Dependencies | Control dependency relationships between jobs | |||
| Job Arrays | Batch submission of parameterized jobs | |||
| QOS Management | Differentiated service level control | |||
| Native Container Orchestration (CRI/CNI) | Native container orchestration based on CRI/CNI standards | |||
| Multi-Tenant Container Network Isolation | CNI-based multi-tenant network isolation (Calico Underlay) | |||
| Container RDMA Network Support | Supports SR-IOV shared RNIC and direct passthrough | |||
| Extended Hardware Compatibility | Supports diverse CPU architectures (x86, ARM, RISC-V) and accelerators from multiple vendors including Nvidia, AMD, Huawei Ascend, and more | |||
| HPC+AI Converged Scheduling | One cluster handles both HPC and AI workloads | |||
| AI Job Runtime Prediction | LLM-based job runtime prediction with 41% accuracy improvement |
Use Cases¶
CraneSched is suitable for a wide variety of computing scenarios:
-
Traditional HPC
Aerodynamics simulation, atmospheric modeling, high-energy physics research, and more. Supports mainstream applications including WRF, OpenFOAM, CMAQ, CESM, ABAQUS, and GROMACS.
-
AI Computing
Efficient training and inference for large models including DeepSeek, Qwen, Llama, CPMBee, and ChatGLM. Supports multiple container environments including Docker and Singularity.
-
Chip Design
Supports EDA chip design and other high-throughput computing workloads with extremely demanding scheduling requirements. Deep adaptation for mainstream EDA tools from Cadence, Synopsys, and others.
-
Scientific Research
Big data analytics, biopharmaceutical design, battery material research, medical large models, and other research scenarios.
Deployment Cases¶
CraneSched is deployed and in production across 8 provinces and cities and 10+ computing centers nationwide.
-
Peking University Weiming Teaching Cluster No.2
Launched in June 2024. Supports real-time online teaching and research for faculty and students, over 300 credit-hours of online teaching, and compatibility with hundreds of user software packages. Transparently migrated from Slurm to CraneSched with no user disruption; stable operation ever since.
-
Peking University Weiming Excellence No.1 Cluster
Launched in November 2024. Fully domestic Huawei Ascend and Kunpeng architecture — the first university-level cluster in China to adopt an all-domestic HPC+AI converged solution. Runs large model training and inference tasks for DeepSeek, Qwen, Llama, and more.
Additional deployments: Institute of Software, Chinese Academy of Sciences; Tianjin University; Beijing Union University; Nanjing University of Aeronautics and Astronautics; Guizhou University of Finance and Economics; Ocean University of China; and others.
Awards: Selected for MIIT "Typical Application Cases" and "Key Recommended Application Cases" in 2024; selected for the "2024 Education Information Technology Application Innovation Outstanding Case Collection."
Technical Highlights¶
-
High Performance
Over 100,000 scheduling decisions per second with fast job–resource matching.
-
Scalability
Proven design for million-core clusters and large-scale deployments.
-
Usability
Clean, consistent CLI for users and admins (cbatch, cqueue, crun, calloc, cinfo, etc.).
-
Security
Built-in RBAC and encrypted communication; fully open-source and self-controlled; compliant with domestic technology security standards.
-
Resilience
Automatic job recovery, no single point of failure, fast state restoration. Distributed fault-tolerant design for stable and reliable operation.
-
Open Source
Licensed under AGPLv3; community-driven and extensible with a pluggable architecture.
CLI Reference¶
- User commands: cbatch, cqueue, crun, calloc, cinfo
- Admin commands: cacct, cacctmgr, ceff, ccontrol, ccancel
- Container commands: ccon
- Exit codes: reference
Links¶
- Demo cluster: https://hpc.pku.edu.cn/demo/cranesched
- Backend repository: https://github.com/PKUHPC/CraneSched
- Frontend repository: https://github.com/PKUHPC/CraneSched-FrontEnd
- SCOW computing platform: https://github.com/PKUHPC/OPENSCOW
License¶
CraneSched is dual-licensed under AGPLv3 and a commercial license. See LICENSE or contact mayinping@pku.edu.cn for commercial licensing.