Skip to content

CraneSched

A next-generation open-source compute scheduler for unified HPC and AI workloads.

CraneSched is a distributed job scheduling system jointly developed by Peking University, the Changsha Institute of Computing and Digital Economy of Peking University, and Changsha Jianshan Tatu Technology Co., Ltd. Targeting HPC and AI workloads, it unifies supercomputing and AI computing resources, breaks down traditional compute barriers, and delivers efficient, stable, and reliable computing services across education, industry, meteorology, defense, and more.

Get Started Try the Demo GitHub


Why CraneSched?

  • Industry-Leading Performance


    Scheduling throughput 5–20x faster than Slurm; real-time scheduling of over 10,000 jobs per second; supports 2,000,000+ concurrent jobs; job dispatch latency under 10 ms. Supports clusters with 100,000+ nodes.

    View performance comparison

  • Deep HPC+AI Convergence


    One cluster handles both HPC and AI workloads, covering all HTC+HPC+AI computing scenarios. Compute, storage, and data resources are pooled for efficient sharing — no more resource silos.

    Learn about the convergence solution

  • Unified Heterogeneous Hardware Management


    Supports X86, ARM, and RISC-V architectures; adapts to Intel, AMD, Phytium, and Kunpeng CPUs; supports Nvidia and AMD GPUs as well as Huawei Ascend, Cambricon, and Kunlunxin accelerators.

    View compatibility list

  • Intelligent Algorithms for Efficiency and Energy Savings


    In-house ORA job runtime prediction algorithm (published at CCF-B conference ICS) with 41% accuracy improvement; in-house TSMF algorithm significantly improves resource utilization; in-house EcoSched algorithm reduces cluster energy consumption by 78.64% under low load.

    Learn about scheduling algorithms

  • Full Slurm/LSF Command Compatibility


    In-house Slurm & LSF Wrapper enables zero-cost, transparent migration — users need not modify any scripts or workflows.

    View compatibility

  • CraneSched + SCOW Integrated Solution


    Deeply integrated with the SCOW computing platform, providing closed-loop lifecycle management spanning resource management, job scheduling, monitoring, and billing — an all-in-one "HPC·AI·Quantum·Cloud" computing service.

    Learn about the integrated solution


Solutions

  • Storage·Compute·Usage Convergence for HPC+AI


    Traditional approaches deploy HPC and AI clusters independently, making compute, storage, and data sharing difficult and creating resource silos. CraneSched's HPC+AI convergence solution delivers:

    • Compute convergence: one cluster handles both HPC and AI workloads
    • Storage convergence: unified storage with pooled data resources
    • Usage convergence: unified platform simplifies operations, unified user authentication and resource management

    Learn more

  • CraneSched + SCOW Integrated Computing Solution


    CraneSched is deeply integrated with SCOW (Super Computing On Web), forming a complete computing center solution:

    • Operations management: billing, user management, account management, identity authentication, permission management
    • Resource usage: online job submission, shell platform, visual desktop, cross-cluster file transfer
    • Resource management: resource virtualization, resource authorization, resource configuration

    Learn more


Feature Completeness

CraneSched comprehensively benchmarks against Slurm in scheduling capabilities, and surpasses it in several key areas.

Feature CraneSched Slurm LSF Description
Backfill Scheduling Run short jobs in idle time windows to improve utilization
Fair-Share Scheduling Fair scheduling policy based on historical usage
Preemption High-priority jobs preempt resources from lower-priority ones
Reservation Reserve resource time windows for specific users or jobs
Power Saving Scheduling Automatically shut down idle nodes under low load
TRES Fine-Grained Tracking Trackable resource types (CPU, memory, GPU, etc.)
Job Dependencies Control dependency relationships between jobs
Job Arrays Batch submission of parameterized jobs
QOS Management Differentiated service level control
Native Container Orchestration (CRI/CNI) Native container orchestration based on CRI/CNI standards
Multi-Tenant Container Network Isolation CNI-based multi-tenant network isolation (Calico Underlay)
Container RDMA Network Support Supports SR-IOV shared RNIC and direct passthrough
Extended Hardware Compatibility Supports diverse CPU architectures (x86, ARM, RISC-V) and accelerators from multiple vendors including Nvidia, AMD, Huawei Ascend, and more
HPC+AI Converged Scheduling One cluster handles both HPC and AI workloads
AI Job Runtime Prediction LLM-based job runtime prediction with 41% accuracy improvement

View full feature comparison


Use Cases

CraneSched is suitable for a wide variety of computing scenarios:

  • Traditional HPC


    Aerodynamics simulation, atmospheric modeling, high-energy physics research, and more. Supports mainstream applications including WRF, OpenFOAM, CMAQ, CESM, ABAQUS, and GROMACS.

  • AI Computing


    Efficient training and inference for large models including DeepSeek, Qwen, Llama, CPMBee, and ChatGLM. Supports multiple container environments including Docker and Singularity.

  • Chip Design


    Supports EDA chip design and other high-throughput computing workloads with extremely demanding scheduling requirements. Deep adaptation for mainstream EDA tools from Cadence, Synopsys, and others.

  • Scientific Research


    Big data analytics, biopharmaceutical design, battery material research, medical large models, and other research scenarios.


Deployment Cases

CraneSched is deployed and in production across 8 provinces and cities and 10+ computing centers nationwide.

  • Peking University Weiming Teaching Cluster No.2


    Launched in June 2024. Supports real-time online teaching and research for faculty and students, over 300 credit-hours of online teaching, and compatibility with hundreds of user software packages. Transparently migrated from Slurm to CraneSched with no user disruption; stable operation ever since.

  • Peking University Weiming Excellence No.1 Cluster


    Launched in November 2024. Fully domestic Huawei Ascend and Kunpeng architecture — the first university-level cluster in China to adopt an all-domestic HPC+AI converged solution. Runs large model training and inference tasks for DeepSeek, Qwen, Llama, and more.

Additional deployments: Institute of Software, Chinese Academy of Sciences; Tianjin University; Beijing Union University; Nanjing University of Aeronautics and Astronautics; Guizhou University of Finance and Economics; Ocean University of China; and others.

Awards: Selected for MIIT "Typical Application Cases" and "Key Recommended Application Cases" in 2024; selected for the "2024 Education Information Technology Application Innovation Outstanding Case Collection."


Technical Highlights

  • High Performance


    Over 100,000 scheduling decisions per second with fast job–resource matching.

  • Scalability


    Proven design for million-core clusters and large-scale deployments.

  • Usability


    Clean, consistent CLI for users and admins (cbatch, cqueue, crun, calloc, cinfo, etc.).

  • Security


    Built-in RBAC and encrypted communication; fully open-source and self-controlled; compliant with domestic technology security standards.

  • Resilience


    Automatic job recovery, no single point of failure, fast state restoration. Distributed fault-tolerant design for stable and reliable operation.

  • Open Source


    Licensed under AGPLv3; community-driven and extensible with a pluggable architecture.


CLI Reference



License

CraneSched is dual-licensed under AGPLv3 and a commercial license. See LICENSE or contact mayinping@pku.edu.cn for commercial licensing.