Memory Hierarchy¶
rvsim models a complete memory hierarchy from TLBs through L3 cache to DRAM, with configurable parameters at every level.
Overview¶
flowchart TD
CPU["CPU Pipeline"] --> ITLB["I-TLB\n32 entries"] & DTLB["D-TLB\n32 entries"]
ITLB --> L1I["L1-I Cache"]
DTLB --> L1D["L1-D Cache"]
ITLB & DTLB -->|miss| L2TLB["L2 TLB\n512 entries · 4-way"]
L2TLB -->|miss| PTW["Hardware PTW\nSV39 page walk"]
L1I & L1D -->|miss| MSHR["MSHRs\ncoalescing"]
MSHR --> L2["L2 Cache"]
L2 -->|miss| L3["L3 Cache"]
L3 -->|miss| MC["Memory Controller"]
MC --> DRAM["DRAM\nrow-buffer timing"]
L1D <--> STB["Store Buffer\nforwarding · WCB"]
Virtual Memory (SV39)¶
The simulator implements the RISC-V SV39 page translation scheme:
- 39-bit virtual addresses with three levels of page tables (VPN[2], VPN[1], VPN[0])
- 4KB base pages, 2MB megapages, 1GB gigapages
- Separate iTLB and dTLB — fully associative, configurable size (default: 32 entries each)
- Shared L2 TLB — set-associative (default: 512 entries, 4-way), accessed on iTLB/dTLB miss
- Hardware page table walker — walks the page table on L2 TLB miss, manages accessed (A) and dirty (D) bits
The TLB hierarchy is bypassed when satp.MODE = Bare (no translation) or in M-mode without mstatus.MPRV set.
Cache Hierarchy¶
L1 Instruction Cache¶
Accessed by the Fetch1 stage. Configurable size, associativity, latency, and replacement policy. Supports hardware prefetching (typically next-line).
Invalidated by:
FENCE.Iinstruction (deferred to commit, drains store buffer first)- Inclusive L2 eviction back-invalidation (if inclusion policy is Inclusive)
L1 Data Cache¶
Accessed by the Memory1 stage. The critical path for load-to-use latency.
Non-blocking operation (MSHRs): When mshr_count > 0, L1D misses allocate a Miss Status Holding Register. The load is parked in the MSHR, and the pipeline continues executing other instructions. Multiple misses to the same cache line are coalesced into a single MSHR entry. When the line arrives from L2/L3/DRAM, all waiting loads are woken up.
Blocking operation (MSHRs = 0): When mshr_count = 0, L1D misses stall the pipeline until the line arrives. This is simpler but prevents the O3 backend from exploiting memory-level parallelism.
L2 / L3 Caches¶
Unified caches accessed on L1 miss. Each level has independent size, associativity, latency, replacement policy, and prefetcher configuration.
Inclusion Policies¶
The relationship between L1 and L2 is configurable:
| Policy | Behavior | Trade-off |
|---|---|---|
| NINE (default) | No inclusion enforcement | Simple, no coherence traffic |
| Inclusive | L2 eviction back-invalidates matching L1 lines | Guarantees L2 is a superset of L1 |
| Exclusive | L1 eviction installs the line into L2 (swap) | Maximizes effective cache capacity |
Store Buffer¶
The store buffer sits between the pipeline and L1D, holding stores that have executed but not yet committed.
- Store-to-load forwarding — when a load address matches a pending store in the buffer, the data is forwarded directly without accessing L1D. Supports full and partial overlap detection.
- Speculative draining — stores can begin draining to L1D before commit, improving throughput
- Write-combining buffer (WCB) — optional buffer that coalesces multiple stores to the same cache line before draining, reducing L1D write port pressure
Hardware Prefetching¶
Each cache level can have an independent hardware prefetcher:
| Prefetcher | How it works |
|---|---|
| NextLine | On any access, prefetch the next degree cache lines |
| Stride | PC-indexed table detects constant-stride access patterns |
| Stream | Detects sequential access streams and prefetches ahead |
| Tagged | Prefetch-on-prefetch: a prefetched line triggers further prefetches |
A shared prefetch deduplication filter prevents redundant requests across levels.
DRAM Controller¶
When all cache levels miss, the request reaches the memory controller:
Simple controller — fixed latency for all accesses.
DRAM controller — models row-buffer aware timing:
- Row hit: access costs
t_cascycles (column access to an already-open row) - Row miss: access costs
row_miss_latencycycles (precharge + row activate + column access) - Bank interleaving: addresses are distributed across banks; accesses to different banks can overlap
- Refresh: periodic refresh cycles (
t_refi/t_rfc) temporarily block accesses
The DRAM controller maintains per-bank row buffer state, so the actual latency of an access depends on whether the target row is already open.