Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.0.0 - 2026-03-20¶
Initial stable release. Both out-of-order and in-order backends boot Linux 6.6 through OpenSBI to a BusyBox shell. All 134/134 riscv-tests pass on both backends.
Pipeline - Out-of-Order Backend¶
- 10-stage superscalar pipeline: Fetch1, Fetch2/Decode, Rename, Issue, Execute, Mem1, Mem2, Writeback, Commit
- Physical register file (PRF) with free list, speculative and committed rename maps
- CAM-style issue queue with wakeup/select, oldest-first priority, configurable width
- Per-type functional unit port limits (load ports, store ports)
- Configurable FU pool: IntALU, IntMul, FpAdd, FpMul, FpFma, FpDiv, Branch, Mem — counts and latencies per type
- Reorder buffer (circular buffer with HashMap tag index for O(1) lookup)
- Precise exceptions via ROB in-order commit
- Load queue for memory ordering violation detection and replay
- Store buffer with store-to-load forwarding (full and partial overlap detection)
- Write-combining buffer for store coalescing
- Branch misprediction recovery: GHR repair from snapshot, RAS restore, rename map rebuild from committed state + surviving ROB entries
- Speculative load wakeup on MSHR availability
- Partial pipeline flush (flush after mispredicting instruction's ROB tag, keep older in-flight work)
Pipeline - In-Order Backend¶
- Scalar pipeline sharing the same frontend and commit/memory/writeback stages as O3
- Scoreboard-based operand tracking with tag bypass from completed ROB entries
- FIFO issue queue with head-of-queue blocking
- Backpressure gating via execute-to-memory1 latch occupancy
- Issue-time serialization checks matching O3 behavior:
- System/CSR instructions wait for all older instructions to complete
- FENCE waits for older operations matching predecessor bits
- Loads/stores blocked by older in-flight FENCE with matching successor bits
- Loads wait for all older stores to resolve addresses
Pipeline - Shared Stages¶
- Commit stage: CSR write serialization, FENCE store-drain semantics, SFENCE.VMA deferred TLB flush (waits for store buffer drain), FENCE.I deferred I-cache invalidation, SATP write with store-drain and redirect, MRET/SRET privilege return, LR/SC reservation check at commit time with SC failure recovery
- Memory1: D-TLB translation, L1D tag probe, MSHR allocation for misses (loads parked, stores proceed), fault propagation
- Memory2: L1D data access, store-to-load forwarding from store buffer, NaN-boxing for FP loads, LR/SC/AMO handling, store buffer address resolution
- Writeback: result selection (load data vs jump link vs ALU), ROB completion, FP flags and PTE update propagation
Memory Hierarchy¶
- SV39 virtual memory with separate iTLB and dTLB (fully-associative, configurable size)
- Shared L2 TLB (set-associative, configurable size/ways/latency)
- Full hardware page table walker with accessed/dirty bit management
- L1-I, L1-D, L2, L3 caches — independently configurable size, associativity, latency
- Cache replacement policies: LRU, Pseudo-LRU (tree-based), FIFO, Random, MRU
- Non-blocking L1D via Miss Status Holding Registers (MSHRs) with request coalescing
- Hardware prefetchers per cache level: next-line, stride (PC-indexed), stream (sequential detection), tagged (prefetch-on-prefetch)
- Prefetch deduplication filter to avoid redundant requests
- Cache inclusion policies: non-inclusive (default), inclusive (back-invalidation on eviction), exclusive (L1-L2 swap on eviction)
- Write-combining buffer for store coalescing before L1D drain
- DRAM controller with row-buffer aware timing: tCAS, tRAS, tPRE, row-miss latency, tRRD (bank-to-bank), tREFI/tRFC (refresh), configurable bank count and row size
Branch Prediction¶
- Five pluggable predictors:
- Static (always not-taken)
- GShare (global history XOR PC, 2-bit saturating counters)
- Tournament (local + global two-level adaptive with meta-predictor)
- Perceptron (neural predictor with configurable history length and table size)
- TAGE (Tagged Geometric History Length with 8 tagged tables, loop predictor, configurable history lengths and tag widths)
- Branch Target Buffer (BTB): set-associative, configurable entries and ways
- Return Address Stack (RAS): circular buffer with snapshot pointer for speculative recovery
- Global History Register (GHR): arbitrary-length bit vector with speculative update and repair from per-instruction snapshots
- RAS link register detection per RISC-V spec Table 2.1: both x1 (ra) and x5 (t0) recognized as link registers, with coroutine swap detection (pop-then-push when rd and rs1 are different link registers)
ISA¶
- RV64I: full base integer instruction set including W-variants (32-bit operations with sign extension)
- M extension: MUL, MULH, MULHSU, MULHU, DIV, DIVU, REM, REMU (+ W variants)
- A extension: LR/SC with forward progress guarantee, AMO operations (SWAP, ADD, AND, OR, XOR, MIN, MAX, MINU, MAXU) for word and doubleword
- F extension: single-precision IEEE 754 — arithmetic, FMA, comparisons, conversions (int-to-float, float-to-int), classification, sign injection, moves, NaN-boxing validation per spec section 12.2
- D extension: double-precision IEEE 754 with full parity to F
- C extension: compressed (16-bit) instruction encoding, expanded to 32-bit equivalents at decode time
- Privileged architecture: M/S/U privilege modes, full CSR set, trap delegation (medeleg/mideleg), MRET/SRET, WFI, ECALL/EBREAK
- SFENCE.VMA with ASID-aware and address-specific TLB invalidation (deferred to commit after store drain)
- FENCE with predecessor/successor ordering bits, FENCE.I with deferred I-cache invalidation
- Physical Memory Protection (PMP): 16 regions with TOR/NAPOT/NA4 address matching
- Counter CSRs: CYCLE, TIME, INSTRET with mcounteren/scounteren access control
- FP CSRs: frm (rounding mode), fflags (exception flags), fcsr; flags accumulated from in-flight pipeline entries for CSR reads
- mstatus privilege control bits: TSR (trap SRET), TW (timeout WFI), TVM (trap virtual memory), FS (FP state)
SoC¶
- CLINT (Core Local Interruptor): mtime/mtimecmp timer interrupt with configurable clock divider
- PLIC (Platform-Level Interrupt Controller): 53 interrupt sources, 2 contexts (M-mode and S-mode), priority-based arbitration, claim/complete protocol
- UART: 16550A-compatible with interrupt support, configurable output (stdout, stderr, quiet)
- VirtIO MMIO block device: virtqueue-based DMA with interrupt notification, filesystem-backed storage
- Goldfish RTC: real-time clock for wall-clock time (used by Linux for system time initialization)
- SYSCON: system controller for poweroff and reboot signals
- HTIF (Host-Target Interface): tohost/fromhost protocol for riscv-tests pass/fail detection and syscall proxying
- Auto-generated Flattened Device Tree (FDT/DTB) synthesized from active configuration
Python API¶
- PyO3-based native extension module (
rvsim._core) Config: composable configuration withBackend,BranchPredictor,Cache,Prefetcher,MemoryController,FubuildersEnvironment: high-level run-to-completion with stats collectionSimulator: low-level tick-by-tick control withrun_until(pc=..., privilege=...)Sweep: parallel multi-configuration benchmarking across CPU cores- Register and CSR access by name (
reg.A0,csr.MSTATUS) - Memory access views (
cpu.mem8,cpu.mem16,cpu.mem32,cpu.mem64) - Pipeline snapshot visualization
- Checkpoint save/restore
- Stats querying with regex filtering and tabulation
Statistics¶
- Cycle accounting: retiring, ROB-empty, ROB-stall, WFI, per-cycle retirement histogram (0/1/2/3+)
- Privilege mode breakdown: U/S/M mode cycles
- Pipeline stalls: memory, control, data hazard, FU structural, backpressure, MSHR-full, dispatch
- Branch prediction: committed and speculative accuracy and misprediction counts
- Pipeline flushes: total, branch-caused, system-caused, squashed instruction count, memory ordering violations
- Cache hierarchy: per-level access counts, hits, miss rates
- Memory subsystem: MSHR allocations/coalesces/full-stalls, load replays
- Inclusion tracking: back-invalidations, exclusive L1-to-L2 swaps
- Write-combining buffer: coalesces and drains
- Prefetch filter: dedup counts per cache level (L1/L2/L3)
- Instruction mix: ALU, load, store, branch, system, FP (broken down into load/store/arith/FMA/div-sqrt)
- FU utilization: per-unit-type busy cycle counts
Analysis Scripts¶
width_scaling.py: IPC vs superscalar width across workloadsbranch_predict.py: predictor accuracy comparisoncache_sweep.py: L1D size vs miss rateinst_mix.py: instruction class breakdownstall_breakdown.py: stall cycle attributiontop_down.py: top-down microarchitecture analysiso3_inorder.py: O3 vs in-order IPC comparisondesign_space.py: multi-dimensional design-space sweep
Build and Test¶
- Rust 2024 edition with strict linting: clippy pedantic + nursery + cargo,
#[deny(unwrap_used, expect_used, todo, unimplemented)] - 1581 tests (1415 unit + 166 integration) + 8 doctests
- Zero clippy warnings across all targets
- Dev profile at opt-level 1 (usable simulation speed during development), release with fat LTO and single codegen unit
- Maturin-based Python packaging with PyO3 stable ABI (abi3-py310)