聯繫我們

課程簡介

Introduction to EXO and Local AI Clustering

  • Overview of the EXO framework and the exo-explore ecosystem
  • Comparing centralized cloud inference vs distributed local inference
  • Architecture: libp2p device discovery, MLX backend, dashboard, and API layers
  • Hardware requirements: Apple Silicon (M3 Ultra, M4 Pro/Max), Thunderbolt 5, shared storage

Installing EXO on macOS

  • Setting up Xcode, Metal ToolChain, and macOS prerequisites
  • Installing uv, Node.js, Rust nightly toolchain
  • Installing the pinned macmon fork for Apple Silicon monitoring
  • Cloning the repository and building the dashboard with npm
  • Running EXO from source and verifying the localhost:52415 dashboard

Installing EXO on Linux

  • Installing dependencies via apt or Homebrew on Linux
  • Configuring uv, Node.js 18+, and Rust nightly
  • Building the dashboard and running EXO in CPU-only mode
  • Directory layout: XDG Base Directory paths for config, data, cache, and logs

Automatic Device Discovery and Cluster Formation

  • Understanding libp2p-based auto-discovery across local networks
  • Configuring custom namespaces with EXO_LIBP2P_NAMESPACE for cluster isolation
  • Verifying node membership in the dashboard cluster view
  • Handling discovery failures and network segmentation issues

Enabling RDMA over Thunderbolt 5

  • RDMA architecture and the 99 percent latency reduction claim
  • Enabling RDMA in macOS Recovery mode with rdma_ctl
  • Cable requirements and port topology constraints on Mac Studio
  • Matching macOS versions across all cluster nodes
  • Troubleshooting RDMA discovery and DHCP configuration

Deploying Frontier Models

  • Using the dashboard to load and shard DeepSeek v3.1, Qwen3-235B, and Llama family models
  • Previewing instance placements with the /instance/previews API endpoint
  • Creating model instances with pipeline or tensor-parallel sharding
  • Configuring custom model cards from HuggingFace hub

Monitoring and Troubleshooting

  • Reading EXO logs and understanding distributed tracing
  • Interpreting cluster health in the dashboard cluster view
  • Diagnosing worker node failures and reconnection behavior
  • Using EXO_TRACING_ENABLED for performance bottleneck analysis

Cluster Maintenance and Updates

  • Updating EXO binaries and dashboard rebuild procedures
  • Migrating model caches and managing pre-downloaded models over NFS
  • Gracefully removing nodes and rebalancing workloads

最低要求

  • An understanding of networking fundamentals (IP, subnetting, firewalls)
  • Experience with macOS or Linux command-line administration
  • Familiarity with Python package management (pip/uv) and Node.js tooling

Audience

  • System administrators
  • DevOps engineers
  • AI infrastructure architects responsible for on-premise LLM deployment
 21 小時

課程分類