RAM and Cache in Computing
This article explores the significance of RAM and cache, their impact on performance, and key concepts Memory Bandwidth
Table of contents
No headings in the article.
In today's computing landscape, the performance of a system is heavily influenced by the efficiency of its Random Access Memory (RAM) and CPU cache. Both of these components play critical roles in determining how quickly and effectively a computer can process tasks. This article explores the significance of RAM and cache, their impact on performance, and key concepts like memory bandwidth, memory latency, cache lines, and memory sharing.
Overview
In today's computing landscape, the performance of a system is heavily influenced by the efficiency of its Random Access Memory (RAM) and CPU cache. Both of these components play critical roles in determining how quickly and effectively a computer can process tasks. This article explores the significance of RAM and cache, their impact on performance, and key concepts like memory bandwidth, memory latency, cache lines, and memory sharing.
RAM (Random Access Memory):
RAM is the primary memory in a computer, used to store data that the CPU needs to access quickly. It is volatile, meaning that it loses its data when the computer is turned off. RAM is essential for providing the CPU with the data and instructions it needs to perform tasks.
CPU Cache:
The CPU cache is a smaller, faster type of volatile memory located directly on or near the CPU. It is designed to temporarily hold data and instructions that the CPU is likely to reuse. The cache reduces the time it takes for the CPU to access frequently used data, thereby speeding up processing.
Memory Bandwidth
Memory bandwidth refers to the rate at which data can be read from or written to RAM by the CPU. Higher bandwidth means that more data can be transferred in a given amount of time, which directly impacts the system's performance.
A system with high memory bandwidth can handle more data-intensive tasks efficiently.
For example, in applications like video editing or gaming, where large amounts of data need to be processed quickly, high memory bandwidth is crucial for smooth performance.
Experiment Analysis
On the data path between the CPU registers and the RAM, there is a hierarchy of caches that exist to speed up access to frequently used data: the layers closer to the processor are faster but also smaller in size. The word “faster” here applies to two closely related but separate timings:
The delay between the moment when a read or a write is initiated and when the data arrives (latency).
The number of memory operations that can be processed per unit of time (bandwidth).
For many algorithms, memory bandwidth is the most important characteristic of the cache system. And at the same time, it is also the easiest to measure, so we are going to start with it.
For our experiment, we create an array and iterate over it K times, incrementing its values:
int a[N];
for (int t = 0; t < K; t++)
for (int i = 0; i < N; i++)
a[i]++;
Changing N and adjusting K so that the total number of array cells accessed remains roughly constant and expressing the total time in “operations per second,” we get a graph like this:
Dotted vertical lines are cache layer sizes
When the whole array fits into the lowest layer of cache, the program is bottlenecked by the CPU rather than the L1 cache bandwidth. As the array becomes larger, the overhead associated with the first iterations of the loop becomes smaller, and the performance gets closer to its theoretical maximum of 16 GFLOPS.
But then the performance drops: first to 12-13 GFLOPS when it exceeds the L1 cache, and then gradually to about 2 GFLOPS when it can no longer fit in the L3 cache.
This situation is typical for many lightweight loops.
Memory Latency
Memory latency is the delay between a request for data by the CPU and the moment it starts receiving that data. Lower latency means quicker access to data, which is essential for real-time processing tasks.
**Impact on Performance:
**While high bandwidth allows more data to be transferred, low latency ensures that the data can be accessed quickly. In scenarios where quick response times are critical, such as in real-time applications or gaming, low latency is just as important as high bandwidth.
Component | Latency (nanoseconds) | Impact on Performance |
L1 Cache | 1-3 | Very Low Latency |
L2 Cache | 3-10 | Low Latency |
L3 Cache | 10-20 | Moderate Latency |
RAM | 50-100 | High Latency |
This table shows the relative latencies of different memory types, highlighting the importance of cache in reducing access times.
Cache Lines
A cache line is the smallest unit of data that can be transferred between the CPU cache and RAM. Typically, a cache line is 64 bytes in size.
Frequency Scaling
All CPU cache layers are placed on the same microchip as the processor, so the bandwidth, latency, and all its other characteristics scale with the clock frequency. The RAM, on the other side, lives on its own fixed clock, and its characteristics remain constant. We can observe this by re-running the same benchmarking with turbo boost on:
This detail comes into play when comparing algorithm implementations. When the working dataset fits in the cache, the relative performance of the two implementations may be different depending on the CPU clock rate because the RAM remains unaffected by it (while everything else does not).
For this reason, it is recommended and advised to keep the clock rate fixed, and as the turbo boost isn’t stable enough, we run most of the benchmarks in this book at plain 2GHz.
Directional Access
This incrementing loop needs to perform both reads and writes during its execution: on each iteration, we fetch a value, increment it, and then write it back. In many applications, we only need to do one of them, so let’s try to measure unidirectional bandwidth.
Calculating the sum of an array only requires memory reads:
for (int i = 0; i < N; i++)
s += a[i];And zeroing an array (or filling it with any other constant value) only requires memory writes:
And zeroing an array (or filling it with any other constant value) only requires memory writes:
for (int i = 0; i < N; i++)
a[i] = 0;
Both loops are trivially vectorized by the compiler, and the second one is actually replaced with a memset, so the CPU is also not the bottleneck here (except when the array fits into the L1 cache).
The reason why unidirectional and bidirectional memory accesses would perform differently is that they share the cache and memory buses and other CPU facilities. In the case of RAM, this causes a twofold difference in performance between the pure read and simultaneous read and write scenarios because the memory controller has to switch between the modes on the one-way memory bus, thus halving the bandwidth. The performance drop is less severe for the L2 cache: the bottleneck here is not the cache bus, so the incrementing loop loses by only ~15%.
There is one interesting anomaly on the graph, namely that the write-only loop performs the same as the read-and-write one when the array hits the L3 cache and the RAM. This is because the CPU moves the data to the highest level of cache on each access, whether it is a read or a write — which is typically a good optimization, as in many use cases we will be needing it soon. When reading data, this isn’t a problem, as the data travels through the cache hierarchy anyway, but when writing, this causes another implicit read to be dispatched right after a write — thus requiring twice the bus bandwidth.
Bypassing the Cache
We can prevent the CPU from prefetching the data that we just have written by using non-temporal memory accesses. To do this, we need to re-implement the zeroing loop more directly without relying on compiler vectorization.
const __m256i zeros = _mm256_set1_epi32(0);
for (int i = 0; i + 7 < N; i += 8)
_mm256_store_si256((__m256i*) &a[i], zeros);
We can replace the usual vector store intrinsic with a non-temporal one:
const __m256i zeros = _mm256_set1_epi32(0);
for (int i = 0; i + 7 < N; i += 8)
_mm256_stream_si256((__m256i*) &a[i], zeros);Non-temporal memory reads or writes are a way to tell the CPU that we won’t be needing the data that we have just accessed in the future, so there is no need to read the data back after a write.
Non-temporal memory reads or writes are a way to tell the CPU that we won’t be needing the data that we have just accessed in the future, so there is no need to read the data back after a write.
On the one hand, if the array is small enough to fit into the cache, and we actually access it some short time after, this has a negative effect because we have to read entirely it from the RAM (or, in this case, we have to write it into the RAM instead of using a locally cached version). And on the other, this prevents read-backs and lets us use the memory bus more efficiently.
In fact, the performance increase in the case of the RAM is even more than 2x and faster than the read-only benchmark. This happens because:
the memory controller doesn’t have to switch the bus between read and write modes this way;
the instruction sequence becomes simpler, allowing for more pending memory instructions;
and, most importantly, the memory controller can simply “fire and forget” non-temporal write requests — while for reads, it needs to remember what to do with the data once it arrives (similar to connection handles in networking software).
Theoretically, both requests should use the same bandwidth: a read request sends an address and gets data, and a non-temporal write request sends an address with data and gets nothing. Not accounting for the direction, we transmit the same data, but the read cycle will be longer because it needs to wait for the data to be fetched. Since there is a practical limit on how many concurrent requests the memory system can handle, this difference in read/write cycle latency also results in the difference in their bandwidth.
Also, for these reasons, a single CPU core usually can’t fully saturate the memory bandwidth.
The same technique generalizes to memcpy: it also just moves 32-byte blocks with SIMD load/store instructions, and it can be similarly made non-temporal, increasing the throughput twofold for large arrays. There is also a non-temporal load instruction (_mm256_stream_load_si256) for when you want to read without polluting cache (e.g., when you don’t need the original array after a memcpy, but will need some data that you had accessed before calling it).