Skip to content

Memory Hierarchy

rvsim models a complete memory hierarchy from TLBs through L3 cache to DRAM, with configurable parameters at every level.

Overview

flowchart TD
    CPU["CPU Pipeline"] --> ITLB["I-TLB\n32 entries"] & DTLB["D-TLB\n32 entries"]
    ITLB --> L1I["L1-I Cache"]
    DTLB --> L1D["L1-D Cache"]
    ITLB & DTLB -->|miss| L2TLB["L2 TLB\n512 entries · 4-way"]
    L2TLB -->|miss| PTW["Hardware PTW\nSV39 page walk"]
    L1I & L1D -->|miss| MSHR["MSHRs\ncoalescing"]
    MSHR --> L2["L2 Cache"]
    L2 -->|miss| L3["L3 Cache"]
    L3 -->|miss| MC["Memory Controller"]
    MC --> DRAM["DRAM\nrow-buffer timing"]
    L1D <--> STB["Store Buffer\nforwarding · WCB"]

Virtual Memory (SV39)

The simulator implements the RISC-V SV39 page translation scheme:

  • 39-bit virtual addresses with three levels of page tables (VPN[2], VPN[1], VPN[0])
  • 4KB base pages, 2MB megapages, 1GB gigapages
  • Separate iTLB and dTLB — fully associative, configurable size (default: 32 entries each)
  • Shared L2 TLB — set-associative (default: 512 entries, 4-way), accessed on iTLB/dTLB miss
  • Hardware page table walker — walks the page table on L2 TLB miss, manages accessed (A) and dirty (D) bits

The TLB hierarchy is bypassed when satp.MODE = Bare (no translation) or in M-mode without mstatus.MPRV set.

Cache Hierarchy

L1 Instruction Cache

Accessed by the Fetch1 stage. Configurable size, associativity, latency, and replacement policy. Supports hardware prefetching (typically next-line).

Invalidated by:

  • FENCE.I instruction (deferred to commit, drains store buffer first)
  • Inclusive L2 eviction back-invalidation (if inclusion policy is Inclusive)

L1 Data Cache

Accessed by the Memory1 stage. The critical path for load-to-use latency.

Non-blocking operation (MSHRs): When mshr_count > 0, L1D misses allocate a Miss Status Holding Register. The load is parked in the MSHR, and the pipeline continues executing other instructions. Multiple misses to the same cache line are coalesced into a single MSHR entry. When the line arrives from L2/L3/DRAM, all waiting loads are woken up.

Blocking operation (MSHRs = 0): When mshr_count = 0, L1D misses stall the pipeline until the line arrives. This is simpler but prevents the O3 backend from exploiting memory-level parallelism.

L2 / L3 Caches

Unified caches accessed on L1 miss. Each level has independent size, associativity, latency, replacement policy, and prefetcher configuration.

Inclusion Policies

The relationship between L1 and L2 is configurable:

Policy Behavior Trade-off
NINE (default) No inclusion enforcement Simple, no coherence traffic
Inclusive L2 eviction back-invalidates matching L1 lines Guarantees L2 is a superset of L1
Exclusive L1 eviction installs the line into L2 (swap) Maximizes effective cache capacity

Store Buffer

The store buffer sits between the pipeline and L1D, holding stores that have executed but not yet committed.

  • Store-to-load forwarding — when a load address matches a pending store in the buffer, the data is forwarded directly without accessing L1D. Supports full and partial overlap detection.
  • Speculative draining — stores can begin draining to L1D before commit, improving throughput
  • Write-combining buffer (WCB) — optional buffer that coalesces multiple stores to the same cache line before draining, reducing L1D write port pressure

Hardware Prefetching

Each cache level can have an independent hardware prefetcher:

Prefetcher How it works
NextLine On any access, prefetch the next degree cache lines
Stride PC-indexed table detects constant-stride access patterns
Stream Detects sequential access streams and prefetches ahead
Tagged Prefetch-on-prefetch: a prefetched line triggers further prefetches

A shared prefetch deduplication filter prevents redundant requests across levels.

DRAM Controller

When all cache levels miss, the request reaches the memory controller:

Simple controller — fixed latency for all accesses.

DRAM controller — models row-buffer aware timing:

  • Row hit: access costs t_cas cycles (column access to an already-open row)
  • Row miss: access costs row_miss_latency cycles (precharge + row activate + column access)
  • Bank interleaving: addresses are distributed across banks; accesses to different banks can overlap
  • Refresh: periodic refresh cycles (t_refi / t_rfc) temporarily block accesses

The DRAM controller maintains per-bank row buffer state, so the actual latency of an access depends on whether the target row is already open.