CraneSched¶

A next-generation open-source compute scheduler for unified HPC and AI workloads.

CraneSched is a distributed job scheduling system for HPC and AI workloads. It unifies supercomputing and AI compute resources on a single platform, delivering efficient, stable, and reliable scheduling for education, industry, meteorology, and more.

It is jointly developed by Peking University, the Changsha Institute of Computing and Digital Economy of Peking University, and Changsha Jianshan Tatu Technology Co., Ltd.

Get Started Try the Demo GitHub

Why CraneSched?¶

Industry-Leading Performance

Scheduling throughput 5–20x faster than Slurm; real-time scheduling of over 10,000 jobs per second; supports 2,000,000+ concurrent jobs; job dispatch latency under 10 ms. Supports clusters with 100,000+ nodes.

View performance comparison
Deep HPC+AI Convergence

One cluster handles both HPC and AI workloads, covering all HTC+HPC+AI computing scenarios. Compute, storage, and data resources are pooled for efficient sharing — no more resource silos.

Learn about the convergence solution
Unified Heterogeneous Hardware Management

Supports X86, ARM, and RISC-V architectures; adapts to Intel, AMD, Phytium, and Kunpeng CPUs; supports Nvidia and AMD GPUs as well as Huawei Ascend, Cambricon, and Kunlunxin accelerators.

View compatibility list
Intelligent Algorithms for Efficiency and Energy Savings

In-house ORA job runtime prediction algorithm (published at CCF-B conference ICS) with 41% accuracy improvement; in-house TSMF algorithm significantly improves resource utilization; in-house EcoSched algorithm reduces cluster energy consumption by 78.64% under low load.

Learn about scheduling algorithms
Full Slurm/LSF Command Compatibility

In-house Slurm & LSF Wrapper enables zero-cost, transparent migration — users need not modify any scripts or workflows.

View compatibility
CraneSched + SCOW Integrated Solution

Deeply integrated with the SCOW computing platform, providing closed-loop lifecycle management spanning resource management, job scheduling, monitoring, and billing — an all-in-one "HPC·AI·Quantum·Cloud" computing service.

Learn about the integrated solution

Solutions¶

Storage·Compute·Usage Convergence for HPC+AI

Traditional approaches deploy HPC and AI clusters independently, making compute, storage, and data sharing difficult and creating resource silos. CraneSched's HPC+AI convergence solution delivers:
- Compute convergence: one cluster handles both HPC and AI workloads
- Storage convergence: unified storage with pooled data resources
- Usage convergence: unified platform simplifies operations, unified user authentication and resource management
Learn more
CraneSched + SCOW Integrated Computing Solution

CraneSched is deeply integrated with SCOW (Super Computing On Web), forming a complete computing center solution:
- Operations management: billing, user management, account management, identity authentication, permission management
- Resource usage: online job submission, shell platform, visual desktop, cross-cluster file transfer
- Resource management: resource virtualization, resource authorization, resource configuration
Learn more

Feature Completeness¶

CraneSched comprehensively benchmarks against Slurm in scheduling capabilities, and surpasses it in several key areas.

Feature	CraneSched	Slurm	LSF	Description
Backfill Scheduling				Run short jobs in idle time windows to improve utilization
Fair-Share Scheduling				Fair scheduling policy based on historical usage
Preemption				High-priority jobs preempt resources from lower-priority ones
Reservation				Reserve resource time windows for specific users or jobs
Power Saving Scheduling				Automatically shut down idle nodes under low load
TRES Fine-Grained Tracking				Trackable resource types (CPU, memory, GPU, etc.)
Job Dependencies				Control dependency relationships between jobs
Job Arrays				Batch submission of parameterized jobs
QOS Management				Differentiated service level control
Native Container Orchestration (CRI/CNI)				Native container orchestration based on CRI/CNI standards
Multi-Tenant Container Network Isolation				CNI-based multi-tenant network isolation (Calico Underlay)
Container RDMA Network Support				Supports SR-IOV shared RNIC and direct passthrough
Extended Hardware Compatibility				Supports diverse CPU architectures (x86, ARM, RISC-V) and accelerators from multiple vendors including Nvidia, AMD, Huawei Ascend, and more
HPC+AI Converged Scheduling				One cluster handles both HPC and AI workloads
AI Job Runtime Prediction				LLM-based job runtime prediction with 41% accuracy improvement

View full feature comparison

Use Cases¶

CraneSched is suitable for a wide variety of computing scenarios:

Traditional HPC

Aerodynamics simulation, atmospheric modeling, high-energy physics research, and more. Supports mainstream applications including WRF, OpenFOAM, CMAQ, CESM, ABAQUS, and GROMACS.
AI Computing

Efficient training and inference for large models including DeepSeek, Qwen, Llama, CPMBee, and ChatGLM. Supports multiple container environments including Docker and Singularity.
Chip Design

Supports EDA chip design and other high-throughput computing workloads with extremely demanding scheduling requirements. Deep adaptation for mainstream EDA tools from Cadence, Synopsys, and others.
Scientific Research

Big data analytics, biopharmaceutical design, battery material research, medical large models, and other research scenarios.

Deployment Cases¶

CraneSched is deployed and in production across 8 provinces and cities and 10+ computing centers nationwide.

Peking University Weiming Teaching Cluster No.2

Launched in June 2024. Supports real-time online teaching and research for faculty and students, over 300 credit-hours of online teaching, and compatibility with hundreds of user software packages. Transparently migrated from Slurm to CraneSched with no user disruption; stable operation ever since.
Peking University Weiming Excellence No.1 Cluster

Launched in November 2024. Fully domestic Huawei Ascend and Kunpeng architecture — the first university-level cluster in China to adopt an all-domestic HPC+AI converged solution. Runs large model training and inference tasks for DeepSeek, Qwen, Llama, and more.

Additional deployments: Institute of Software, Chinese Academy of Sciences; Tianjin University; Beijing Union University; Nanjing University of Aeronautics and Astronautics; Guizhou University of Finance and Economics; Ocean University of China; and others.

Awards: Selected for MIIT "Typical Application Cases" and "Key Recommended Application Cases" in 2024; selected for the "2024 Education Information Technology Application Innovation Outstanding Case Collection."

Technical Highlights¶

High Performance

Over 100,000 scheduling decisions per second with fast job–resource matching.
Scalability

Proven design for million-core clusters and large-scale deployments.
Usability

Clean, consistent CLI for users and admins (cbatch, cqueue, crun, calloc, cinfo, etc.).
Security

Built-in RBAC and encrypted communication; fully open-source and self-controlled; compliant with domestic technology security standards.
Resilience

Automatic job recovery, no single point of failure, fast state restoration. Distributed fault-tolerant design for stable and reliable operation.
Open Source

Licensed under AGPLv3; community-driven and extensible with a pluggable architecture.

CLI Reference¶

User commands: cbatch, cqueue, crun, calloc, cinfo
Admin commands: cacct, cacctmgr, ceff, ccontrol, ccancel
Container commands: ccon
Exit codes: reference

Links¶

Demo cluster: https://hpc.pku.edu.cn/demo/cranesched
Backend repository: https://github.com/PKUHPC/CraneSched
Frontend repository: https://github.com/PKUHPC/CraneSched-FrontEnd
SCOW computing platform: https://github.com/PKUHPC/OPENSCOW

License¶

CraneSched is dual-licensed under AGPLv3 and a commercial license. See LICENSE or contact mayinping@pku.edu.cn for commercial licensing.