Alasir Enterprises
Main Page >  Articles >  Alpha: The History in Facts and Comments >  Press Releases  

 
Main Page
 
 
Reviews
 
Articles
 
Software
 
Reference
 
Motley
 
 
About Us
 
 
 
From: neideck@kar.dec.com (Burkhard Neidecker-Lutz)
Newsgroups: comp.arch
Subject: Re: Alpha 21264 info?
Date: 24 Oct 1996 17:44:07 GMT


The title of the talk:

	The 21264: A Superscalar Alpha Processor with Out-of-Order Execution


So, well, Alpha has finally joined the brainiac club. The trick is, we
didn't sacrifice the trademark clock speed while doing so. And while we
were at it, we fixed a couple of nuisances that occasionally give us
surprises with the older Alpha implementations.

The marketing highlights:

     Estimated SPECint95 of 30+, SPECfp95 of 50+
     Much better cache and memory system
     500 Mhz+ operation in 0.35 um process
     4-way out-of-order execution
     MPEG2 @ MLP *encode* in real time

Now the details.

Physical:

Same 0.35 CMOS process as used for the 500 Mhz 21164, but two additional
metal layers for power distribution (so it's a 6-layer metal process).
Die size approx. 300 mm square, 15.2 million transistors. Speed bins
starting at 500 Mhz, power is 60 watts @ 500 Mhz. 588 Pin Grid Array Package.

Logical:

	64 KByte 2-way setassociative instruction cache
	64 KByte 2-way setassociative data cache
	4  Integer Units	(2 of which are also load-store units)
	2  Floating Point Units
	7  Stage Integer Pipeline
	10 Stage Floating Point Pipeline

Branching:

	Next line predictor	(allows branches without fetch bubbles)
				(allows dynamic prediciton of computed jumps)
	Set predictor		(allows 2-way associativity at high speed)
	Two level branch predictor (run a 2-bit traditional counter predictor
				    and a global pattern detecting branch
				    predictor in parallel and dynamically
				    pick the one whose right more often)
	Branch predictor about twice as good as the one in the 21164

Out-of-Order execution:

	80 physical integer registers
		- 32 architectural
		-  8 PAL-code shadow
		- 40 rename registers
	72 physical floating point registers
		- 32 architectural
		- 40 rename registers

	20 entry integer queue, quad-issue
	15 entry floating point queue, dual-issue

	Out-of-Order mapper is a 500K transistor structure and is one
	of the critical pathes in the chip. 80 entry CAM for mapping
	up to 80 instructions in flight. Backing out to any state takes
	1 cycle.

Integer units:

	4 units:

	add/logic/motion-video/shift/branch
	add/logic/multiply/shift/branch
	add/logic/memory
	add/logic/memory

	In order to get that many register ports, this is implemented
	as two identical copies of an 80 register file with two units
	attaching to each copy. The two register files are kept identical
	with a 1-cycle delay between clusters.

Floating point units:

	add/div/square root
	multiply

	4 cycle latency, fully pipelined. Divide is not pipelined, retires
	6 bits/cycle (compared to 2 bits/cycle in the 21164). The new
	SQRT retires 2 bits/cycle (and also isn't pipelined).

Data Cache, load-store reorder buffers:

	2 loads/stores per cycle, any combination
	implemented as a single ported 1 Ghz cache...
	32 entry load reorder buffer
	32 entry store reorder buffer
	Stores check load buffer to enforce ordering
	Fine grain cache control through cache prefetch instructions

Board level cache:

	L1 Dcache 8+ Gbyte/sec. sustained, 3 cycle load-to-use (like 21064)
	L2 cache  4+ GByte/sec. sustained, 128 bit separate port,
		  12 cycles load-to-use

	Board level cache can be built in 4 ways from 3 types of SRAM:

		1. No board level cache
		2. 133 Mhz Klamath-type Burst-RAM, 2.1 Gbyte/sec. bandwidth
		3. 250 Mhz Late-write SSRAM, 4.0 Gbyte/sec. bandwidth
		4. 333 Mhz Dual-data clock forwarding FSRAM, 5.3 GByte/sec. bw

	The board level cache can be 0, 1, 2, 4, 8 or 16 Mbyte in size.

	
Memory System:

	System Interface 2+ GByte/sec. sustained, 64 bit separate port,
		  80 cycles load-to-use (with Tsunami desktop chip set).

	16 outstanding memory references, 64 bytes each:
		- 8 reads
		- 8 writes

	With Tsunami system chip set and SDRAMs, effective McCalpin
	STREAM bandwidth is 1.6 Gbyte/sec.

Availability:

	Samples Q1/97
	Volume  H2/97

So, it's vapor right now, but if you want to sell vapor in 1997 you better
had damn fast vapor then...


Burkhard Neidecker-Lutz

EUROMEDIA - Distributed Multimedia Archives for Cooperative TV Production
CEC Karlsruhe , European Applied Research Center, Digital Equip. Corp.
email: neideck@kar.dec.com 
AlphaStation 500/500: SPECint95 15.0, SPECfp95 20.4
back to the article


Designed and maintained by Alasir Enterprises, 1999-2007
rhett from alasir.com, walter from alasir.com