Changelog¶

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

1.0.0 - 2026-03-20¶

Initial stable release. Both out-of-order and in-order backends boot Linux 6.6 through OpenSBI to a BusyBox shell. All 134/134 riscv-tests pass on both backends.

Pipeline - Out-of-Order Backend¶

10-stage superscalar pipeline: Fetch1, Fetch2/Decode, Rename, Issue, Execute, Mem1, Mem2, Writeback, Commit
Physical register file (PRF) with free list, speculative and committed rename maps
CAM-style issue queue with wakeup/select, oldest-first priority, configurable width
Per-type functional unit port limits (load ports, store ports)
Configurable FU pool: IntALU, IntMul, FpAdd, FpMul, FpFma, FpDiv, Branch, Mem — counts and latencies per type
Reorder buffer (circular buffer with HashMap tag index for O(1) lookup)
Precise exceptions via ROB in-order commit
Load queue for memory ordering violation detection and replay
Store buffer with store-to-load forwarding (full and partial overlap detection)
Write-combining buffer for store coalescing
Branch misprediction recovery: GHR repair from snapshot, RAS restore, rename map rebuild from committed state + surviving ROB entries
Speculative load wakeup on MSHR availability
Partial pipeline flush (flush after mispredicting instruction's ROB tag, keep older in-flight work)

Pipeline - In-Order Backend¶

Scalar pipeline sharing the same frontend and commit/memory/writeback stages as O3
Scoreboard-based operand tracking with tag bypass from completed ROB entries
FIFO issue queue with head-of-queue blocking
Backpressure gating via execute-to-memory1 latch occupancy
Issue-time serialization checks matching O3 behavior:
System/CSR instructions wait for all older instructions to complete
FENCE waits for older operations matching predecessor bits
Loads/stores blocked by older in-flight FENCE with matching successor bits
Loads wait for all older stores to resolve addresses

Pipeline - Shared Stages¶

Commit stage: CSR write serialization, FENCE store-drain semantics, SFENCE.VMA deferred TLB flush (waits for store buffer drain), FENCE.I deferred I-cache invalidation, SATP write with store-drain and redirect, MRET/SRET privilege return, LR/SC reservation check at commit time with SC failure recovery
Memory1: D-TLB translation, L1D tag probe, MSHR allocation for misses (loads parked, stores proceed), fault propagation
Memory2: L1D data access, store-to-load forwarding from store buffer, NaN-boxing for FP loads, LR/SC/AMO handling, store buffer address resolution
Writeback: result selection (load data vs jump link vs ALU), ROB completion, FP flags and PTE update propagation

Memory Hierarchy¶

SV39 virtual memory with separate iTLB and dTLB (fully-associative, configurable size)
Shared L2 TLB (set-associative, configurable size/ways/latency)
Full hardware page table walker with accessed/dirty bit management
L1-I, L1-D, L2, L3 caches — independently configurable size, associativity, latency
Cache replacement policies: LRU, Pseudo-LRU (tree-based), FIFO, Random, MRU
Non-blocking L1D via Miss Status Holding Registers (MSHRs) with request coalescing
Hardware prefetchers per cache level: next-line, stride (PC-indexed), stream (sequential detection), tagged (prefetch-on-prefetch)
Prefetch deduplication filter to avoid redundant requests
Cache inclusion policies: non-inclusive (default), inclusive (back-invalidation on eviction), exclusive (L1-L2 swap on eviction)
Write-combining buffer for store coalescing before L1D drain
DRAM controller with row-buffer aware timing: tCAS, tRAS, tPRE, row-miss latency, tRRD (bank-to-bank), tREFI/tRFC (refresh), configurable bank count and row size

Branch Prediction¶

Five pluggable predictors:
Static (always not-taken)
GShare (global history XOR PC, 2-bit saturating counters)
Tournament (local + global two-level adaptive with meta-predictor)
Perceptron (neural predictor with configurable history length and table size)
TAGE (Tagged Geometric History Length with 8 tagged tables, loop predictor, configurable history lengths and tag widths)
Branch Target Buffer (BTB): set-associative, configurable entries and ways
Return Address Stack (RAS): circular buffer with snapshot pointer for speculative recovery
Global History Register (GHR): arbitrary-length bit vector with speculative update and repair from per-instruction snapshots
RAS link register detection per RISC-V spec Table 2.1: both x1 (ra) and x5 (t0) recognized as link registers, with coroutine swap detection (pop-then-push when rd and rs1 are different link registers)

ISA¶

RV64I: full base integer instruction set including W-variants (32-bit operations with sign extension)
M extension: MUL, MULH, MULHSU, MULHU, DIV, DIVU, REM, REMU (+ W variants)
A extension: LR/SC with forward progress guarantee, AMO operations (SWAP, ADD, AND, OR, XOR, MIN, MAX, MINU, MAXU) for word and doubleword
F extension: single-precision IEEE 754 — arithmetic, FMA, comparisons, conversions (int-to-float, float-to-int), classification, sign injection, moves, NaN-boxing validation per spec section 12.2
D extension: double-precision IEEE 754 with full parity to F
C extension: compressed (16-bit) instruction encoding, expanded to 32-bit equivalents at decode time
Privileged architecture: M/S/U privilege modes, full CSR set, trap delegation (medeleg/mideleg), MRET/SRET, WFI, ECALL/EBREAK
SFENCE.VMA with ASID-aware and address-specific TLB invalidation (deferred to commit after store drain)
FENCE with predecessor/successor ordering bits, FENCE.I with deferred I-cache invalidation
Physical Memory Protection (PMP): 16 regions with TOR/NAPOT/NA4 address matching
Counter CSRs: CYCLE, TIME, INSTRET with mcounteren/scounteren access control
FP CSRs: frm (rounding mode), fflags (exception flags), fcsr; flags accumulated from in-flight pipeline entries for CSR reads
mstatus privilege control bits: TSR (trap SRET), TW (timeout WFI), TVM (trap virtual memory), FS (FP state)

SoC¶

CLINT (Core Local Interruptor): mtime/mtimecmp timer interrupt with configurable clock divider
PLIC (Platform-Level Interrupt Controller): 53 interrupt sources, 2 contexts (M-mode and S-mode), priority-based arbitration, claim/complete protocol
UART: 16550A-compatible with interrupt support, configurable output (stdout, stderr, quiet)
VirtIO MMIO block device: virtqueue-based DMA with interrupt notification, filesystem-backed storage
Goldfish RTC: real-time clock for wall-clock time (used by Linux for system time initialization)
SYSCON: system controller for poweroff and reboot signals
HTIF (Host-Target Interface): tohost/fromhost protocol for riscv-tests pass/fail detection and syscall proxying
Auto-generated Flattened Device Tree (FDT/DTB) synthesized from active configuration

Python API¶

PyO3-based native extension module (rvsim._core)
Config: composable configuration with Backend, BranchPredictor, Cache, Prefetcher, MemoryController, Fu builders
Environment: high-level run-to-completion with stats collection
Simulator: low-level tick-by-tick control with run_until(pc=..., privilege=...)
Sweep: parallel multi-configuration benchmarking across CPU cores
Register and CSR access by name (reg.A0, csr.MSTATUS)
Memory access views (cpu.mem8, cpu.mem16, cpu.mem32, cpu.mem64)
Pipeline snapshot visualization
Checkpoint save/restore
Stats querying with regex filtering and tabulation

Statistics¶

Cycle accounting: retiring, ROB-empty, ROB-stall, WFI, per-cycle retirement histogram (0/1/2/3+)
Privilege mode breakdown: U/S/M mode cycles
Pipeline stalls: memory, control, data hazard, FU structural, backpressure, MSHR-full, dispatch
Branch prediction: committed and speculative accuracy and misprediction counts
Pipeline flushes: total, branch-caused, system-caused, squashed instruction count, memory ordering violations
Cache hierarchy: per-level access counts, hits, miss rates
Memory subsystem: MSHR allocations/coalesces/full-stalls, load replays
Inclusion tracking: back-invalidations, exclusive L1-to-L2 swaps
Write-combining buffer: coalesces and drains
Prefetch filter: dedup counts per cache level (L1/L2/L3)
Instruction mix: ALU, load, store, branch, system, FP (broken down into load/store/arith/FMA/div-sqrt)
FU utilization: per-unit-type busy cycle counts

Analysis Scripts¶

width_scaling.py: IPC vs superscalar width across workloads
branch_predict.py: predictor accuracy comparison
cache_sweep.py: L1D size vs miss rate
inst_mix.py: instruction class breakdown
stall_breakdown.py: stall cycle attribution
top_down.py: top-down microarchitecture analysis
o3_inorder.py: O3 vs in-order IPC comparison
design_space.py: multi-dimensional design-space sweep

Build and Test¶

Rust 2024 edition with strict linting: clippy pedantic + nursery + cargo, #[deny(unwrap_used, expect_used, todo, unimplemented)]
1581 tests (1415 unit + 166 integration) + 8 doctests
Zero clippy warnings across all targets
Dev profile at opt-level 1 (usable simulation speed during development), release with fat LTO and single codegen unit
Maturin-based Python packaging with PyO3 stable ABI (abi3-py310)