Alasir Enterprises
Main Page >  Articles >  Functional Principles of Cache Memory  

Main Page
About Us
Functional Principles of Cache Memory

Paul V. Bolotoff
Release date: 20th of April 2007
Last modify date: 5th of May 2007

in Russian


Different caches may feature different numbers of ports. In a matter of fact, it depends on processor functional units and their pipelines how many ports to implement. Regular single-ported cache memory can handle one read or write request per cycle. When it comes to D-cache, a single-ported implementation may be good enough for a simple processor design with one or two execution pipelines only. More pipelines require more data to operate, and it's definitely not good to have a pipeline stalled because D-cache has failed to deliver data in a timely manner. At the same time, adding a cache port may be a serious task requiring a complete redesign of a particular cache subsystem.
Let's continue with an example. Consider a superscalar RISC processor with 4 integer execution pipelines (2 of them are capable of calculating virtual addresses) and 2 floating-point execution pipelines — DEC Alpha 21264. Most instructions may be dispatched onto the pipelines every cycle, what means the maximal throughput as well as load imposed on I-cache and D-cache. Alpha is a RISC architecture with fixed-length instructions of 4 bytes each, so parallel and look-ahead instruction decoding may be implemented with no trouble unlike for CISC architectures such as x86. In general, load/store commands consume from 20% to 50% of instruction stream, also there are about 2 load operands per 1 store. So, a viable commercially configuration must include no less than 2 read ports of D-cache, and it's anticipated highly to allow for parallel execution of 1 write and 1 read request at least. The question is how to get it?
There are several ways to organise multiported caches:
  • ideal multiporting;
  • time division multiplexing;
  • mirroring (cloning);
  • interleaving.

Ideal multiporting implies non-standard cache cell structure. While a regular single-ported static memory cell requires 6 field-effect transistors to be built upon, a dual-ported static memory cell needs additional 2 transistors. Such dual-ported cells are more expensive and and not as fast as single-ported. In addition, they consume more power.
Time division multiplexing is a technique which may be also referenced as cache overclocking. In other words, if to make cache memory running at 2x (3x, 4x) processor clock speed, it will be able to process two (three, four) requests per every processor cycle. Although such a cache memory is single-ported physically, it behaves actually like if it were dual (triple, quad) ported ideally. For instance, DEC Alpha 21264 accommodates D-cache of 64Kb which runs at 2x processor clock speed. This approach doesn't increase manufacturing costs, but requires stronger engineering efforts because of more complicated electrical and thermal management. It's a common knowledge that power consumption scales linearly with clock speed given constant voltage.
Mirroring (cloning) is a technique to implement two (three, four) identical caches. It means that they're of the same size and organisation, also they contain the same information. This approach allows for two (three, four) read ports because every cache part may be queried independently. However, there is only one write port because all cache parts must be written concurrently. Another bad thing is that a single write transaction blocks all read transactions because every cache part features a single physical port. Although it is possible in theory to supply every cache part with a delayed write buffer to work around this issue to some extent. To throw an example in, DEC Alpha 21164 accommodates D-cache of 8Kb consisting physically of two identical 8Kb parts. There is almost no engineering trouble to follow with cache mirroring (cloning), but this way is the least desirable in means of manufacturing costs. In a matter of fact, small caches only may be mirrored (cloned).
Interleaving is based on logical segmentation of cache memory. If there are several cache segments available, it can be made possible to query them in parallel. For instance, MIPS R10000 accommodates D-cache of 32Kb with 2-way set associativity, which is capable of 2-way interleaving. This solution has almost no influence on manufacturing costs, also it isn't much complicated from engineering point of view. However, this technique suffers greatly from segment conflicts. If there are two or more requests to hit a single cache segment, they have to be processed consequently because every segment is single-ported physically.
Of course, the techniques explained above may be combined in actual hardware designs. Although the reality is that industry-standard dual-ported direct-mapped SRAM chips compared to single-ported ones of the same technological process are either up to 50% slower and/or up to 120% more power consuming and/or up to 150% larger in means of silicon die size.
<< Previous page Next page >>

Copyright (c) Paul V. Bolotoff, 2007. All rights reserved.
A full or partial reprint without a permission received from the author is prohibited.
Designed and maintained by Alasir Enterprises, 1999-2007
rhett from, walter from