DEC unveiled the very first information about the 2nd generation Alpha
processor at a Hot Chips conference located in Palo Alto (California, the USA)
which started on the 14th of August 1994. Although
21164 (
EV5) was
presented officially only after a respective press release by DEC which was dated
by the 7th of September 1994. The processor was based upon the core of EV45 and
was rather an evolution of the latter than a revolutionary new design. When
compared to EV4 or EV45,
the number of pipelines was doubled, both
integer and floating-point. In addition, the floating-point pipelines were
transformed to run through 9 stages rather than 10. However, the integer
pipelines weren't all the same if compared to each other: while both were
capable of basic arithmetical and logical operations, the 1st only could
multiply and shift, and the 2nd only was able to process
conditional/unconditional branches. Both pipelines were able to calculate
virtual addresses for load instructions, but the 1st one only — for store
ones. The floating-point pipelines were different as well: the 1st could execute
any floating-point code except of multiply instructions which were the only code
the 2nd pipeline could process. I-box was able to fetch and decode up to 4
instructions per cycle to provide the execution units with a proper load. Was
manufactured using the same proprietary 4-layer 0.5µ CMOS5 process as of
EV45, therefore required the same 3.3V power supply. Consisted of 9.3 mln.
transistors (including 7.8 mln. spent on integrated caches), possessed a
die size of 299mm² (close to theoretical limits of the technological
process involved). Core frequencies of 21164 ranged from
266MHz to 333MHz
(TDP from 46W to 56W). Form-factor: IPGA-499 (Interstitial Pin Grid Array).
I-cache and D-cache were sized and organised just like in EV4, i. e.
8Kb each, write-through. Although D-cache was made dual-ported, i. e. was
able to deliver data for 2 load instructions per cycle. Sacrificing transistors
for the sake of performance, D-cache was composed physically of 2 identical
parts of 8Kb each, so data could be read from either one but had to be written
to the both. The processor was accommodated with 96Kb of the integrated L2 cache
(S-cache, secondary cache), write-back, 3-way set associative, and C-box
accessed it through a dedicated 128-bit data bus. At the same time, B-cache was
also functional though remained optional, consisted of external cache SRAMs and
could be as large as 64Mb, though usually from 1Mb to 4Mb. 128-bit data bus to
B-cache was multiplexed with the system data bus still. So, EV5 supported 3
cache levels, and was the first processor to feature such hierarchy.
S-cache was accessed through a 4-stage pipeline: 2 cycles for tag look-up
and bank activation plus 2 cycles for data access and delivery (16 bytes per
cycle), though an extra cycle was required for data to propagate across the
processor from C-box to D-cache and either E-box or F-box. Engineers who
designed EV5 considered to implement tag look-up and data access in parallel, so
all 3 banks would deliver data to be evaluated upon arrival. This approach could
reduce the pipeline by 1 stage, but would cause a serious impact on processor
power consumption (+40% estimated). However, it didn't prevent D-cache from
operating this way, but there was only 1 bank of 8Kb rather than 3 banks of 32Kb
each. Even more, read latency of D-cache was reduced from 3 to 2 cycles. Every
line of S-cache was 64 bytes wide with one tag per line, though it was possible
to address every line as if there were two sublines 32 bytes wide each because
I-cache and D-cache operated with 32-byte lines. S-cache was inclusive to
D-cache. In turn, B-cache was inclusive to S-cache with no regard to write-back
policy of the latter and the difference in associativities. I-TLB held 48
entries (for pages sized from 8Kb to 4Mb), D-TLB — 64 entries (for pages
sized from 8Kb to 4Mb) and was dual-ported for load operations in the same
manner as D-cache. The system data bus was fixed-length at 128 bits with
additional 16 bits for ECC protection, still multiplexed with the data path to
B-cache, though more effective because of a new split-transaction protocol. The
system address bus was 40-bit, the system control bus — 10-bit.
|
|
21164A (
EV56) was introduced at a Microprocessor Forum in
October of 1995. It was a modified release of EV5 after a redesign for a
proprietary 4-layer 0.35µ CMOS6 process. It was manufactured at the same
semiconductor factory in Hudson (Massachusetts, the USA), and DEC had invested
about 450 mln. USD in modernisation prior to. The most important
architectural difference was
BWX (Byte-Word Extension) — a set of
6 additional instructions to load/store data in 8- or 16-bit quanta (LDBU, LDWU,
STB, STW, SEXTB, SEXTW). Right from the start, the Alpha architecture was forced
to load/store data in 32- or 64-bit quanta what caused certain difficulties
while porting or emulating code belonging to other processor architectures such
as i386 or MIPS. A request to implement BWX in hardware was submitted in June of
1994 by Richard Sites and was approved in June of 1995. Although to perform BWX
transfers a system logic had to be aware of it as well. Core frequencies of
21164A (EV56) ranged from
366MHz to 666MHz (TDP from 31W to 55W), and the
manufacturing started somewhere in the summer of 1996. Also was produced by
Samsung under a licence agreement signed in June of 1996 (a 666MHz version was
available from Samsung only). Consisted of 9.66 mln. transistors, possessed
a die size of 209mm², required a dual voltage power supply (2.5V for
primary and 3.3V for input/output circuits).
|
|
|
(click to enlarge, 53Kb)
|
(click to enlarge, 144Kb)
|
21164PC (
PCA56) was introduced on the 17th of March 1997.
It was a low-cost version of EV56 designed by DEC and Mitsubishi cooperatively.
S-cache was absent as well as accompanying logic, but I-cache size was increased
from 8Kb to 16Kb. Was manufactured with the same CMOS5 process and required a
2.5V/3.3V power supply. Consisted of 3.5 mln. transistors and possessed a
die size of 141mm². Core frequencies of 21164PC (PCA56) ranged from
400MHz to 533MHz (TDP from 26W to 35W). Form-factor was changed in favour
of IPGA-413. There was also 0.28µ 21164PC (
PCA57) manufactured by
Samsung. I-cache and D-cache of it were doubled in size, and I-cache was made
2-way set associative. At the same time, transistors' count increased to
5.7 mln. but die size decreased to 101mm². It required lower voltages
than PCA56 (2.0V for primary and 2.5V for input-output circuits). Core
frequencies of 21164PC (PCA57) ranged from
533 to 666MHz (TDP from 18W to
23W).
In addition to BWX which was inherited from EV56, PCA56 and PCA57 supported
a new set of 13 SIMD (Single Instruction, Multiple Data) instructions called
MVI (Motion Video Instructions): packing (PACKWB, PACKLB), unpacking
(UNPKBW, UNPKBL), choosing minimum (MINUB8, MINSB8, MINUW4, MINSW4) or maximum
(MAXUB8, MAXSB8, MAXUW4, MAXSW4) and motion estimating (PERR). The most
interesting was the last one, Pixel ERRor, which processed 8 pixels at once.
Unlike the MMX instruction set for i386 processors which utilised floating-point
registers to store data, MVI used integer registers for that purpose.
The first standard system logic set developed for EV5 was
DEC Alcor
(
21171). It supported 33MHz system bus frequency, up to 16Mb of B-cache,
up to 8Gb of FPM or EDO DRAM ECC through a 256-bit data path, also a single
64-bit PCI bus at 33MHz. Support for either the ISA or EISA bus could be
implemented through use of a standard bridge such as i82378IB (ISA) or i82378EB
(EISA). There was no built-in IDE controller which could be installed separately
using third-party hardware. DEC Alcor consisted physically of 5 chips: 1
universal controller with the PCI bus support (Control, I/O and Address —
CIA) and 4 data switches (DSW). A new release of this system logic set for
21164A, i. e. with the BWX support added, was called
DEC
Alcor 2 (
21172). It was followed soon by
DEC Pyxis
(
21174), a single-chip solution supporting 66MHz system bus frequency and
up to 1,5Gb of 66MHz SDRAM ECC or parity accessed through a 128-bit data path,
also a single 64-bit PCI bus at 33MHz. There was also
VLSI Polaris
developed for systems based on 21164PC (PCA57).