Alasir Enterprises
Main Page >  Articles >  A Quick Analysis of the NVIDIA Fermi Architecture  

Main Page
About Us
A Quick Analysis of the NVIDIA Fermi Architecture

Paul V. Bolotoff
Release date: 16th of February 2010
Last modify date: 30th of March 2010


1. Introduction. NVIDIA G80 and G92.
2. NVIDIA GT200 and GT300. Conclusions.

NVIDIA GT200 and GT300

GT200 was based upon G80, though not without significant improvements. The total number of shader pipelines was increased from 128 to 240 which were gathered into 10 clusters with 3 subclasters each. Like G92, it featured 8 texture load/store units per cluster and supported PCI Express v2.0. Like G80, it relied upon NVIO for output interfaces. Every register unit of GT200 operated with a double size register file (64Kb) which allowed to improve performance on multiple threads and long shaders. There are 3 subclusters per cluster, so the 1st level texture cache has got a size increase from 16Kb to 24Kb per cluster (240Kb in total). Although the total number of texture filtering units didn't change, their performance was also improved significantly. Last but not least, there were 8 memory channels 64-bit each (512 bits in total) which implied 8 raster partitions with 4 ROPs each (32 ROPs in total). Such a wide memory interface was introduced in AMD/ATI R600, though the following top performance graphprocessors by AMD/ATI featured less complicated 256-bit memory designs.
GT200 was supposed to enter the market in November of 2007, but the actual appearance was made only in June of 2008. Nevetheless, it was a very impressive design counting 1.4 milliard transistors which possessed a die size of awesome 576mm² under a 65nm TSMC technological process. No need to mention probably that it was a very expensive thing in manufacturing means, so a 55nm die shrink called GT200b or GT206 appeared in January of 2009 with a die size of 470mm². It was also a bit late, the original schedule mentioned something about August of 2008. A 40nm version called GT200c or GT212 had never been produced. Neither GT200 nor any of its family members supported the complete DirectX 10.1 feature set (Shader Model 4.1). The company had got issues with other 40nm designs which were not nearly as complicated as GT200. In particular, GT214 was turned over for another development cycle and had seen the release as GT215. GT216 and GT218 were delayed several times. It seems obvious that NVIDIA has got real problems with 40nm, but instead of solving them as soon as possible they have placed a large bet on GT300, another monstrous design. A big mistake? Time will tell. In general, GT200 based cards appeared to be expensive power hungry devices (GeForce GTX 280 was advertised with a 650USD initial target price and 236W TDP), and they delayed on the market for about half a year. So far, GT300 seems to follow the bad luck of GT200.
There isn't much to say about GT300 (also advertised as GF100) when it comes to architecture and technology as these things are kept pretty much confidential, but some information has surfaced. There are 512 pipelines gathered into 4 clusters now called graphics processing clusters, and every such a cluster is subdivided for 4 subclusters still called streaming multiprocessors. So, there are 128 pipelines per cluster and 32 pipelines per subcluster. Every subcluster is accommodated with 2 warp schedulers, 2 dispatch units, 4 special functions units, 16 texture load/store units, a register unit with a large 128Kb register file and so on. There are 64Kb of local memory per cluster which may be user configured as 16Kb of the 1st level cache memory (hardware managed) and 48Kb of shared memory (software managed) or vice versa. As it has been mentioned before, G80 and GT200 have 16Kb of shared memory per cluster and no true 1st level cache memory. In general, a single cluster of GT300 is more advanced than of either G80 or GT200. It has been announced that GT300 will have the 2nd level cache memory of 768Kb; to be correct, there will be 128Kb of such cache per memory channel. G80 and GT200 also have the 2nd level cache memory of either 32Kb or 64Kb per memory channel respectively, but keep in mind that their shader pipelines cannot make any use of it. GT300 can grant access to the 2nd level cache for both texture units and shader pipelines in read/write mode. About the memory interface, first rumours have mentioned that it's going to be 512 bits wide, but now we can be sure that there will be 6 memory channels 64-bit each (384 bits in total) like in G80. The primary memory type for GT300 will be GDDR5 SDRAM as opposed to G80 and GT200 which relied upon GDDR3 SDRAM. While ECC implementation for the register file and caches is going to be regular SEC/DED, NVIDIA have developed a proprietary ECC algorithm for memory protection: there will be no additional data lines and memory chips installed for this purpose as checksums will be stored in reserved portions of regular video memory. GT300 will support the IEEE 754-2008 standard for floating point calculations instead of the older IEEE 754-1985, though there isn't much difference. GT300 also seems to have a hardware tesselating logic as a part of the PolyMorph engine. What's even more interesting, there are expected to be as many tesselating units as clusters. Finally, GT300 will make use of NVIO just like G80 and GT200.
NVIDIA GT300 (GF100) block diagram
NVIDIA GT300 (GF100) cluster block diagram

GT300 is expected to consist of 3 milliard transistors with a die size at 530mm² given a 40nm TSMC technological process. Considering price of a single 300mm wafer at TSMC for this process at 5000USD to 6000USD and the die size above, GT300 is going to be much more expensive than a 55nm GT200b. If to consider additionally the amount of resources spent on development of GT300 and the Fermi architecture as well as low manufacturing yields and delays to the market, then it makes a tough job for NVIDIA to generate any reasonable profit out of this project. About release schedule, it has been planned originally for GT300 to be supplied in quantity to OEMs in Q3 2009. It's got moved for Q4 2009 after some serious design and manufacturing issues kept strictly confidential. Anyway, GT300 has failed totally to hit the Christmas and New Year sales, and the release schedule has been postponed once again to Q1 2010. The latest rumours tell that it's going to happen in March of 2010, so let's see.
NVIDIA GT200 and GT300 (GF100) die photographs
(click to enlarge, 514Kb)


It's diffucult to make any statements on performance of GT300. In a matter of fact, it depends mostly upon positives and negatives of the Fermi architecture as well as real clock speed of the GT300 shader domain. The latter is expected to be between 1.5GHz to 2.0GHz, though probably closer to the lower limit rather than higher. Anyway, let's suppose that actual performance will be between 1600 to 3000 gigaFLOPS on single precision floating point or from 800 to 1500 gigaFLOPS on double precision floating point. That's very impressive because GT200 based Tesla C1060 with its shader domain at 1.3GHz can do 933 gigaFLOPS single precision or 78 gigaFLOPS double precision. The primary competitor, 40nm RV870 (Cypress) based Radeon HD5870 by AMD/ATI, delivers 2720 gigaFLOPS single precision or 544 gigaFLOPS double precision at 850MHz. The current AMD/ATI top performance product for scientific calculations, 55nm RV790 based FireStream 9270, can do 1200 gigaFLOPS single precision or 240 gigaFLOPS double precision at 750MHz. It seems apparent that the next RV870 based FireStream will be at least two times faster than model 9270. It is also apparent that future GT300 based products won't gain any serious advantage over RV870 based products in single precision performance but will prevail significantly in double precision. The primary conclusion is that when it comes to computer gaming, GT300 and RV870 will be pretty much equal in means of performance, but GT300 will be preferred for scientific calculations. Actual prices, power consumption, support quality and so on may adjust the decision here and there as it usually is.

(continued; 28-Mar-2010)

So, GT300 (or GF100) has hit the market officially on the 26th of March. There are two cards released by NVIDIA through their partners, GeForce GTX470 and GeForce GTX480. What's the most interesting thing is that both of them are based upon GT300 with some units disabled. It seems NVIDIA has faced really poor manufacturing yields, but it's hardly possible for them to delay their Fermi based products any further, so they have made the decision. Well, it's the first time for NVIDIA to release a top performance product with masked units. A very unpopular move which may cost NVIDIA some reputation. It seems they have simply got no other choice: something is better than nothing at all. In means of competition, GeForce GTX470 is supposed to be an alternative to Radeon HD5850, and GeForce GTX480 is going to hurt Radeon HD5870 sales. See the table below for the cards' specifications. Note that execution units of NVIDIA and AMD/ATI graphprocessors cannot be compared by real numbers due to a very different architecture, so both real and approximate effective numbers are shown for AMD/ATI products.
GeForce GTX470
GeForce GTX480
Radeon HD5850
Radeon HD5870
Graphprocessor GT300 (GF100) GT300 (GF100) RV870 RV870
Clock speed (core logic) 607MHz 700MHz 725MHz 850MHz
Clock speed (shader pipelines) 1215MHz 1400MHz 725MHz 850MHz
TMUs 56 60 18
(72 effective)
(80 effective)
ROPs 40 48 8
(32 effective)
(32 effective)
Shader pipelines 448 480 288
(1440 effective)
(1600 effective)
Clock speed (memory)1 3350MHz 3700MHz 4000MHz 4800MHz
Memory bus width 320-bit 384-bit 256-bit 256-bit
Memory size 1280Mb 1536Mb 1024Mb / 2048Mb 1024Mb / 2048Mb
TDP2 215W 250W 151W 188W
Idle power consumption ~50W ~50W ~30W ~30W
MSRP 350USD 500USD 300USD 400USD

1 — effective data transfer speed of GDDR5 SDRAM;
2 — real world peak power consumption may be higher.
As you may have guessed already, GeForce GTX470 is powered by GT300 with 2 subclusters disabled, so minus 64 shader pipelines and 8 TMUs. The memory bus width is only 320-bit, hence one 64-bit memory controller is disabled together with 8 ROPs and 128Kb of the 2nd level cache memory. Considering low clock speeds, especially the 1.2GHz shader domain, this video card isn't going to fly sky high. On the other hand, NVIDIA will be able to satisfy market demand on GeForce GTX470 cards even with poor manufacturing yields. GT300 chips for GTX480 come with 1 subcluster disabled. That's not good, but there are other things to worry about. First of all, the shader domain clocked at 1.4GHz isn't what most people including myself have expected from this top performance product. Although I haven't been too optimistic, but I've expected for it to cross a 1.5GHz boundary at least. Another important issue is power consumption. I'm not sure what to do with those 250W of TDP reported by NVIDIA. There is some information that real world peak power consumption of GTX480 is 300W to 320W, and the card gets very hot even when running at default clock speeds: its core temperatures are well over 90°C consistently. Finally, here comes the money question. Radeon HD5870 is priced at 400USD for a 1Gb version, it's available on the market for 6 months, it's less power hungry and about as fast as GeForce GTX480 (plus or minus 10% here and there don't make much difference) while the latter is priced at 500USD. Frankly speaking, it doesn't make any sense.
And one more thing. The only serious advantage GT300 could have over RV870 is outstanding double precision floating point performance. However, this way GT300 based GeForce cards would be highly competitive agaist GT300 based Tesla cards on some markets. Keep in mind that superior double precision performance is of very little to zero importance when playing computer games, encoding or decoding video streams, etc. So, NVIDIA have reduced double precision floating point performance of GT300 based GeForce cards by 4 times through a software lock probably (136 gigaFLOPS for GTX470 and 168 gigaFLOPS for GTX480). It's unclear whether this is a temporary solution or not, but currently GeForce GTX470 and GTX480 are much slower in double precision than Radeon HD5850 and HD5870 respectively. Those who need superior double precision performance are kindly advised by NVIDIA to purchase Tesla C2050 or C2070 cards. The first one comes with 3Gb of 384-bit 3600MHz GDDR5 SDRAM (2.625Gb available with ECC enabled), the second one — with 6Gb of 384-bit 4000MHz GDDR5 SDRAM (5.250Gb available with ECC enabled). Both of them are powered by GT300 with 2 subclusters disabled (minus 64 shader pipelines). These cards can do double precision floating point calculations at 560 and 628 gigaFLOPS respectively, and single precision floating point calculations — at 1120 and 1256 gigaFLOPS respectively. NVIDIA wants 2500USD for C2050 and 4000USD for C2070.
<< Previous page  

Copyright (c) Paul V. Bolotoff, 2010. All rights reserved.
A full or partial reprint without a permission received from the author is prohibited.
Designed and maintained by Alasir Enterprises, 1999-2011
rhett from, walter from