Different caches may feature different numbers of ports. In a matter of
fact, it depends on processor functional units and their pipelines how many
ports to implement. Regular single-ported cache memory can handle one read or
write request per cycle. When it comes to D-cache, a single-ported
implementation may be good enough for a simple processor design with one or two
execution pipelines only. More pipelines require more data to operate, and it's
definitely not good to have a pipeline stalled because D-cache has failed to
deliver data in a timely manner. At the same time, adding a cache port may be a
serious task requiring a complete redesign of a particular cache subsystem.
Let's continue with an example. Consider a superscalar RISC processor with 4
integer execution pipelines (2 of them are capable of calculating virtual
addresses) and 2 floating-point execution pipelines — DEC Alpha 21264.
Most instructions may be dispatched onto the pipelines every cycle, what means
the maximal throughput as well as load imposed on I-cache and D-cache. Alpha is
a RISC architecture with fixed-length instructions of 4 bytes each, so parallel
and look-ahead instruction decoding may be implemented with no trouble unlike
for CISC architectures such as x86. In general, load/store commands consume
from 20% to 50% of instruction stream, also there are about 2 load operands per
1 store. So, a viable commercially configuration must include no less than 2
read ports of D-cache, and it's anticipated highly to allow for parallel
execution of 1 write and 1 read request at least. The question is how to get it?
There are several ways to organise multiported caches:
- ideal multiporting;
- time division multiplexing;
- mirroring (cloning);
Ideal multiporting implies non-standard cache cell structure. While a
regular single-ported static memory cell requires 6 field-effect transistors
to be built upon, a dual-ported static memory cell needs additional 2
transistors. Such dual-ported cells are more expensive and and not as fast as
single-ported. In addition, they consume more power.
Time division multiplexing is a technique which may be also referenced as
cache overclocking. In other words, if to make cache memory running at 2x (3x,
4x) processor clock speed, it will be able to process two (three, four) requests
per every processor cycle. Although such a cache memory is single-ported
physically, it behaves actually like if it were dual (triple, quad) ported
ideally. For instance, DEC Alpha 21264 accommodates D-cache of 64Kb which
runs at 2x processor clock speed. This approach doesn't increase manufacturing
costs, but requires stronger engineering efforts because of more complicated
electrical and thermal management. It's a common knowledge that power
consumption scales linearly with clock speed given constant voltage.
Mirroring (cloning) is a technique to implement two (three, four) identical
caches. It means that they're of the same size and organisation, also they
contain the same information. This approach allows for two (three, four) read
ports because every cache part may be queried independently. However, there is
only one write port because all cache parts must be written concurrently.
Another bad thing is that a single write transaction blocks all read
transactions because every cache part features a single physical port. Although
it is possible in theory to supply every cache part with a delayed write buffer
to work around this issue to some extent. To throw an example in, DEC Alpha
21164 accommodates D-cache of 8Kb consisting physically of two identical 8Kb
parts. There is almost no engineering trouble to follow with cache mirroring
(cloning), but this way is the least desirable in means of manufacturing costs.
In a matter of fact, small caches only may be mirrored (cloned).
Interleaving is based on logical segmentation of cache memory. If there are
several cache segments available, it can be made possible to query them in
parallel. For instance, MIPS R10000 accommodates D-cache of 32Kb with 2-way set
associativity, which is capable of 2-way interleaving. This solution has almost
no influence on manufacturing costs, also it isn't much complicated from
engineering point of view. However, this technique suffers greatly from segment
conflicts. If there are two or more requests to hit a single cache segment, they
have to be processed consequently because every segment is single-ported
Of course, the techniques explained above may be combined in actual hardware
designs. Although the reality is that industry-standard dual-ported
direct-mapped SRAM chips compared to single-ported ones of the same
technological process are either up to 50% slower and/or up to 120% more power
consuming and/or up to 150% larger in means of silicon die size.