Although
21264 (
EV6) processor was developed by DEC and was
mentioned first at a Microprocessor Forum in October of 1996, the final silicon
implementation was done in February of 1998 when DEC was in process of
liquidation. The processor itself was a significant step forward when compared
to EV5, not a tuned up old design at all. One of the most important innovations
was
out-of-order execution which implied a fundamental core redesign and
lowered functional units' dependence upon cache and operating memory's
bandwidth. EV6 could reorder up to 80 instructions on the fly, more than other
competitive products could. For instance, the P6 architecture by Intel was able
to execute out-of-order up to 40 [microcommands], HP PA-8x00 — up to 56,
MIPS R12000 — up to 48, IBM POWER3 — up to 32, Motorola PowerPC G4
— up to 5, and Sun UltraSPARC II didn't support instruction reordering at
all. There was also
register renaming technique implemented, so EV6
accommodated 80 integer and 72 floating-point physical registers, but the number
of architectural (logical) registers remained unchanged, i. e. 32 integer
and 32 floating-point.
There were 4 integer pipelines available, i. e. twice as many as EV5
was given. They were organised in 2 clusters with 2 pipelines and an 80-entry
integer register file per cluster. Those 2 register files were identical
(syncronised) though. However, those pipelines were different functionally: the
2nd pipeline of the 1st cluster was capable of shifting (1-cycle latency) and
multiplying (7-cycle latency), the 2nd pipeline of the 2nd cluster — of
shifting (1-cycle latency) and executing MVIs (3-cycle latency). The 1st
pipeline of every cluster helped A-box by calculating virtual addresses for
load/store operations. Apart of that, all 4 integer pipelines were capable of
basic arithmetical and logical operations (1-cycle latency). A-box itself worked
with I-TLB and D-TLB (128 entries each), load and store queues (32 instructions
each), also 8 64-byte buffers (miss address file) for transactions involving
B-cache and operating memory. Floating-point pipelines were different
functionally as well. The 1st pipeline was capable of adding (4-cycle latency),
dividing (12-cycle latency for single-precision operands and 15-cycle for
double-precision) and square root calculation (15-cycle and 30-cycle
respectively), but the 2nd one was only capable of multiplying (4-cycle
latency). Like before in EV5, I-box was able to decode up to 4 instructions per
cycle and dispatch them into 2 queues, to E-box called E-queue (20 instructions)
and to F-box called F-queue (15 instructions).
C-box was redesigned significantly and was made capable of supporting only 2
cache levels. The integrated L1 cache memory consisted of 64Kb I-cache and 64Kb
D-cache, both 2-way set associative with 64-byte lines. D-cache was write-back
as well as B-cache, hence no S-cache at all. B-cache was inclusive to D-cache
though. Because of a large size D-cache read/write latencies were increased from
2 to 3 cycles (to/from an integer register) and 4 cycles (to/from a
floating-point register). D-cache remained dual-ported, but it was made not of
2 identical write-synchronised parts like in EV5, but of a single part clocked
at double the core frequency. External B-cache of 1Mb to 16Mb, direct-mapped,
write-back, was accessed through an independent bidirectional 128-bit data bus
with a 16-bit channel for ECC protection, also a unidirectional 20-bit address
bus. B-cache was built of LW SSRAM chips (late write), later of DDR SSRAM ones
(double data rate). Speed of B-cache was programmable ranging from 2/3 to 1/8 of
EV6 core frequency. Unlike for the previous generations of Alpha processors,
B-cache itself wasn't optional. The system data bus was only 64-bit wide with an
additional 8-bit ECC protection, but was able to transfer data on both rising
and falling edges of clock signal, i. e. was DDR capable. The system address
bus was 44-bit wide implemented physically through two 15-bit unidirectional
paths, the system control 15-bit wide. The basic functional principle of
the system bus was changed, so the bus became dedicated instead of shared, thus
every processor possessed an own path to a system logic set.
The branch prediction logic was redesigned completely. It followed a 2-level
scheme with a local history table of 1024 records 10-bit each and a local
predictor of 1024 records 3-bit each coupled with a global predictor of 4096
records 3-bit each, also a history path of 12 bits. Both local and global
algorithms worked independently, and if the local one traced every branch
available, the global one traced sequences of branches. The chooser analysed
results of both algorithms and made conclusions to a separate choice predictor
of 4096 records 2-bit each which was the source of a preferred decision if the
predictions were different. Such a cooperative approach allowed to achieve
better results than any of the algorithms if used stand-alone.
Engineers who developed EV6, considering a large number of functional units
and other difficulties, decided to redesign the clock subsystem entirely. A more
efficient signal flow allowed the core to reach frequencies of the much simpler
core of EV56 while involving almost the same technological process. Overall,
power consumed by the clock subsystem of EV6 was about 32% of the total core
power. To compare, it was about 25% for EV56, about 37% for EV5 and about 40%
for EV4.
EV6 was manufactured using the same technological process to of EV56, but
with 2 additional metallisation layers. Consisted of 15.2 mln. transistors
(including about 9 mln. spent for I-cache, D-cache and branch predictors),
possessed a die size of 314mm² and required a 2.1V to 2.3V power supply.
21264 (EV6) core frequencies ranged from
466MHz to 600MHz (TDP approx.
from 80W to 110W). Form-factor: PGA-587 (Pin Grid Array).
|
|
|
|
|
(click to enlarge, 62Kb)
|
(click to enlarge, 128Kb)
|
21264A (
EV67) entered the market in the end of 1999. Was
produced by Samsung using a 0.25µ CMOS process, posessed a die size of
210mm² and required a lower power supply of 2.0V. No significant
architectural differences if compared to EV6. 21264A (EV67) core frequencies
ranged from
600MHz to 833MHz (TDP approx. from 70W to 100W) which allowed
the Alpha architecture to bring back the leadership on integer tasks, lost not
so much time ago to Intel Pentium III (Coppermine) and AMD Athlon (K7).
The first samples of
21264B (
EV68C) were delivered in the
beginning of 2000. This processor was produced by IBM using a 0.18µ CMOS
process of its own involving copper conductors. Despite absence of any
architectural differences still, the promising technology allowed to rise core
frequencies right
up to 1250MHz. In 2001, Samsung became able to
manufacture 21264B (
EV68A) in quantity using a 0.18µ CMOS process
of its own, but involving aluminium conductors. If compared to EV67, the die
size was reduced by over than one third (to 125mm²), also the voltage did
decrease (to 1.7V). 21264B (EV68A) core frequencies ranged between
750MHz and
940MHz (TDP approx. from 60W to 75W). It was declared in September of 1998
that EV68 by Samsung would be implemented in an innovative 0.18µ FD-SOI
(Fully Depleted Silicon-On-Insulator) process involving copper conductors, so
it should be able to reach 1.5GHz and even more. Unfortunately, it didn't
happen.
|
|
|
(click to enlarge, 82Kb)
|
(click to enlarge, 128Kb)
|
Different sources mention 21264C and 21264D, code-named as
EV68CB and
EV68DC respectively, manufactured by IBM using the same technology as
EV68C and running within the same frequency range, so they could be considered
as minor modifications. The only noticeable difference was a new form-factor,
pinless LGA-675 (Land Grid Array) instead of PGA-587. Apparently, these
processors were installed in Compaq servers only.
Behind of BWX and MVI inherited from the previous generation of Alpha
processors, there was a new set of 9 instructions implemented in EV6 called
FIX or
FX (Floating-point eXtension) which was aimed at square
root calculations (SQRTF, SQRTG, SQRTS, SQRTT), data transfers from integer to
floating-point registers (ITOFF, ITOFS, ITOFT) and from floating-point to
integer registers (FTOIS, FTOIT). Another set of 3 instructions called
CIX or
CX (Count eXtension) was introduced in EV67 to facilitate
bit counting tasks (CTLZ, CTTZ, CTPOP). Finally, EV6 and the derivatives
featured two prefetching instructions (ECB, WH64) in addition to FETCH and
FETCH_M which existed from the beginning of the architecture.
There were 2 system logic sets designed initially for the 21264 processors:
DEC Tsunami (
21272; also known as Typhoon) and
AMD Irongate
(
AMD-751), though could be many more if to take into account that both
21264 and Athlon utilised almost the same system bus licenced by DEC to AMD.
DEC Tsunami was a highly scalable system logic set. It could be used to
design 1-processor as well as 2-processor and 4-processor systems with a memory
data path ranging from 128 to 512 bits (83MHz SDRAM ECC registered) and
supporting from one to two 33MHz 64-bit PCI buses. Such a flexibility could be
achieved because the system logic set consisted of 3 kinds of components: system
bus controllers (C-chips, one per processor), memory bus controllers (D-chips,
one per every 64 bits of the bus width) and PCI bus controllers (P-chips, one
per bus needed). So, there is no wonder that some systems (for example, AlphaPC
264DP) were accommodated with system logic sets consisting of 12 chips.
Although AMD Irongate (AMD-751) was developed to serve as a north bridge on
Athlon-based mainboards accompanied with the AMD Viper (AMD-756) south bridge or
a compatible one, it was also used in some Alpha mainboards (to be correct, in
UP1000 and UP1100). Being a single-chip solution, it cost much less than DEC
Tsunami and consumed much less energy. However, it wasn't the best solution for
21264 because lacked support for multiprocessing and had a narrow memory data
bus (64-bit, up to 768Mb of SDRAM ECC unbuffered at 100MHz in 3 DIMMs with 2 RAS
lines each). Nevertheless, Irongate was the first system logic set for Alpha to
feature the AGP bus support.
In 2001, Samsung introduced the UP1500 mainboard which was a
single-processor solution designed upon the AMD Irongate-2 (AMD-761) north
bridge. This mainboard was superior to UP1000 and UP1100 in means of performance
due to support for a much faster operating memory: either up to 4Gb of DDR SDRAM
ECC registered at 133MHz in 4 DIMMs with 2 RAS lines each or up to 2Gb of DDR
SDRAM ECC unbuffered at the same 133MHz in 2 DIMMs with 2 RAS lines each. The
memory data bus remained of the same width though.