Alasir Enterprises
Main Page >  Articles >  Functional Principles of Cache Memory  

Main Page
About Us
Functional Principles of Cache Memory

Paul V. Bolotoff
Release date: 20th of April 2007
Last modify date: 20th of April 2007

in Russian

Hierarchical Model

In any computer system, cache memory is separated logically and physically for so-called levels. For example, if to take into consideration an abstract machine with 32Kb of internal (built into the processor core) and 1Mb of external (located on either the processor module or the mainboard) cache memory, the first one may be called cache memory of the 1st level while the second one — of the 2nd level. Modern computer systems may accommodate up to four cache levels, though two-level organisation remains the most popular. Cache memory of the 1st level is subdivided usually for instruction cache (I-cache) and data cache (D-cache) — that is so-called Harvard architecture. Although I-cache and D-cache are of the same size usually, it isn't mandatory. At the same time, they are integrated into processor cores almost always because of performance reasons. There are not so many opposite examples, and most of them are representatives of the PA-RISC architecture. Hewlett-Packard PA-7000 utilises external I-cache and D-cache of 256Kb each, PA-7100 and PA-7150 — of 1Mb and 2Mb respectively, PA-8000 — of 1Mb each, and PA-8200 — of 2Mb each. However, all these processors have never featured high clock speeds, so it has been of no trouble to run their I-cache and D-cache at full core clock speed with tolerable access latencies. Nevertheless, these processors have delivered very good performance in the past. If to take SPEC95 benchmarks into consideration, then a HP Visualize C200 workstation with a 200MHz PA-8000 inside could be found equivalent to a DEC Personal WorkStation 600au armed with a 600MHz Alpha 21164A in means of floating-point performance (21.4 versus 21.3), though lagged behind for about 20% in means of integer performance (14.3 versus 18.4). It doesn't mean at all that a particular architecture is better or worse than another, it's to remind that core clock speed and cache organisation are just two factors among many which define performance of any hardware implementation.
There are several reasons to explain why split caches prevail over unified (U-cache) on the 1st level. Different functional units fetch information from I-cache and D-cache: decoder and scheduler (I-box) operate with I-cache, but integer execution unit (E-box), also referred as arithmetical and logical unit, and floating-point unit (F-box) communicate with D-cache. There are also load/store unit (A-box) with cache and system bus controller (C-box), which are involved directly in operations with caches. By the way, every functional unit consists of one or several execution pipelines usually. I-cache and D-cache operate with very low access latencies because their increase would cause a serious performance loss on most tasks. In order to maintain them low, cache size is sacrificed and figures into values from 8Kb to 64Kb usually. That's not an easy task to place a large cache on a silicon die and to assure its proper syncronisation. If a particular cache gets larger in size while keeping its internal organisation intact, it would inevitably take more time to search through the cache for some information and to transfer what has been looked for to the output. Apart of that, more pipelines of functional units utilise a particular cache, more access ports are required to satisfy them, while adding new ones is a pretty stiff job (shall be told in depth about cache ports later). Additionally, U-cache is affected by other drawbacks. For example, all data must be evicted while flushing instructions from the cache. This situation occurs usually on various system exceptions which intend to flush processor pipelines and restart them at a new address. Although it's a regular practice to flush I-cache with virtual tagging on every task switch as well. On the other hand, U-cache allows for a more effective utilisation of itself, i. e. there is a variable proportion between instructions and data contained with dependence on a task being executed.
Cache memory of the 2nd level is unified almost always, though there are several well-known exceptions to name. For instance, HARP-1 (Hitachi Advanced RISC Processor) of the PA-RISC v1.1 architecture contained 8Kb I-cache and 16Kb D-cache which were backed up by external instruction and data caches of 512Kb each. Forth, the design of SPARC64 V by HAL Computer (not that one by Fujitsu which went into production under the same name) featured 32Kb I-cache (expanded with a 1024-entry trace cache) and 8Kb D-cache which were supported by integrated instruction and data caches of 256Kb and 512Kb respectively; an external unified cache of up to 64Mb was also employed. There is Intel Itanium 2 (that one code-named as Montecito) among most recent examples. Every of its two cores manages dedicated 16Kb I-cache and 16Kb D-cache as well as integrated instruction and data caches of the 2nd level sized at 1Mb and 256Kb respectively, also an integrated unified cache of the 3rd level sized at 12Mb. The primary reason of going unified is because cache memory of the 2nd level doesn't need to be as fast as of the 1st level. In real life, I-cache and D-cache satisfy about 80-90% of all memory requests usually, so it makes room for a trade-off: caches of the lower levels may feature higher access and delivery latencies in exchange for larger sizes and more associativity ways to improve hit rates. If cache memory of the 2nd level gets integrated into processor core, it may be called S-cache (secondary cache). If a particular processor employs an integrated cache of the 3rd level, it may be called T-cache (ternary cache). If there is an external cache which consists of regular discrete static memory chips most likely, it may referenced as B-cache (back-up cache). In a matter of fact, B-cache happens to be the last level of cache hierarchy. Unlike integrated caches, B-cache could be driven either by processor's C-box or by system logic or even by both of them. In general, every next cache level is larger but slower than the previous one.
Hierarchy of cache levels could be traced to the best during processor's run-time. If there is no datum available in a register while executing some instruction, a request has to be generated and dispatched to the nearest cache level, i. e. to D-cache. If a miss returns, the request goes redirected to the next cache level and so forth. In the worst case, the datum will be delivered directly from the operating memory. Although the arrival could be delayed even more if the datum has been pushed out previously by the virtual memory subsystem to a swap file (or swap partition for that matter), i. e. to a hard disk drive. It takes from tens to hundreds processor cycles to receive a necessary register-wide quantum of information from operating memory, but when it comes to hard disk drives the count may be for hundreds thousand or millions cycles.
<< Previous page Next page >>

Copyright (c) Paul V. Bolotoff, 2007. All rights reserved.
A full or partial reprint without a permission received from the author is prohibited.
Designed and maintained by Alasir Enterprises, 1999-2007
rhett from, walter from