MMX itself stands either for MultiMedia eXtension or Matrix Math eXtension
or nothing at all. Anyway, it doesn't really matter. This technology has been
developed by Intel to accelerate a wide range of multimedia tasks, i. e.
those related to video and audio processing. In fact, a new instruction set has
been introduced to perform these tasks directly on a general-purpose x86
processor. Previously, it has been a question of expensive third-party DSP-based
hardware designs to achieve performance improvements in specific areas, for
instance, MPEG video compression & decompression. Basically, the MMX
technology allows for the same or comparable effect at no additional costs, it
only needs to rewrite some parts of multimedia applications. Primary architects
of this technology are Alex Peleg and Uri Weiser of Intel's Israel Design
Center in Haifa.
This technology has been implemented initially in the Pentium MMX
(P55C) processor in January of 1997 and next in Pentium II (Klamath) in
April of 1997. It has also been supported by competitive products such as AMD
K6, Cyrix 6x86MX and IDT WinChip. In fact, all x86 processors manufactured after
1997 can execute MMX-optimised code. It also seems to be a fact that this
technology has achieved a great success among multimedia software developers.
Although MMX isn't the first technology to appear in this area. The pioneers are
the Intel 860 and Motorola 88110 processors which have introduced limited
instruction extensions to accelerate basic functions of 3D graphics back in
1991-92. There are also the Hewlett-Packard PA-7100LC (1994), Sun UltraSPARC
(1995) and MIPS R10000 (1996) processors as well as their derivatives which
feature support for MAX-1 (Multimedia Acceleration eXtensions - 1), VIS (Visual
Instruction Set) and MDMX (MIPS Digital Media eXtension) respectively. Although
MDMX may be considered as the closest relative to MMX because it has been also
intended to accelerate multimedia calculations and to follow a register mapping
technique explained below.
The MMX technology follows a vectorised execution approach also known as
SIMD (Single Instruction, Multiple Data). There are 8 new 64-bit MMX registers
available — MM0 to MM7 — as well as 47 new MMX instructions
separated for 7 groups (arithmetical, logical, comparing, shifting, converting,
transferring, state managing). However, it needs to mention that the MMX
registers happen to be mapped onto the mantissa space (i. e. the lower 64
bits) of the respective floating-point registers — ST(0) to ST(7) —
which are 80-bit in accordance with double extended format defined in the IEEE
754 standard. The exponent space and the sign bit (i. e. the upper 16 bits)
are set to all ones on every write operation to a particular hardware register,
thus indicating NaN (Not a Number) or infinity if this register with MMX data
happens to be treated as containing FP data.
i387-compatible floating-point register
|
|
|
The MMX registers can be accessed in random order unlike the FP registers
which are subjects for stack order. Nevertheless, since both the MMX and FP
register sets have been implemented using the same hardware register set, it
isn't really possible to execute both MMX and FP code concurrently. That's a
large drawback, though in practice most subroutines incorporating MMX code don't
need to perform any FP operations. On the other hand, operating systems don't
need to get updated to support the technology because the MMX registers can be
saved and restored during task switches as if they were the FP registers (see
the description of the EMMS instruction for further information on this matter).
The MMX instructions operate with several new data types. There are packed
byte, packed word, packed doubleword, and quadword. Since the MMX registers are
64-bit wide, it takes either 8 bytes (a packed byte) or 4 words (a packed word)
or 2 doublewords (a packed doubleword) or 1 quadword to populate a register.
Vectorisation means that all data entities are processed in parallel. In other
words, if there are 4 words of data in a particular register, they may be
processed at once with a single MMX instruction. Having traditional integer or
floating-point code, it would be necessary to process those data words one by
one, so performance decrease may be tangible very much. The following example
illustrates how the PADDW instruction works.
63 |
0 |
|
|
|
src |
|
|
dst |
|
word3 + word7 |
word2 + word6 |
word1 + word5 |
word0 + word4 |
|
dst (new) |
In addition, the technology introduces both wrapping-around and saturating
instructions. In case of wrap-around, results overflown or underflown are
truncated and only the lower (least significant) bits are returned. In case of
saturation, results overflown or underflown are clipped (saturated) to the data
range limit for a type given. For instance, word calculations are saturated to
the maximum possible value (0xFFFF) in case of overflow and to the minimum
possible value (0x0000) in case of underflow. It's important very much for some
tasks to choose the right kind of instructions.
All MMX instructions feature 1-cycle latency except of PMUL (Packed
MULtiply) and PMADD (Packed Multiply and ADD) which take 3 cycles to complete,
though they're fully pipelined and can be issued every cycle to a functional
unit. Transferring instructions may be processed in a single cycle, though
actual execution time depends on data disposition. EMMS (Exit MultiMedia State)
is the only instruction with an undefined (implementation dependent) execution
time. The information above applies to Intel processors only, other designs
might behave differently.
The simplest way to find out whether a particular processor supports MMX is
to use the CPUID instruction. It's known that the standard function 1 returns
bit 23 of EDX set high if the support is available. The following code
illustrates this procedure:
movl $1, %eax
cpuid
testl $0x00800000, %edx
jz .mmx_unsupported
There are no operands labelled as "first" and "second" throughout this
reference. It seems to be less ambiguous to have them defined as "source" and
"destination" because these operands are positioned reversely by the assembly
language syntaxes of AT&T and Intel style. It's a matter of taste to prefer
either, but the AT&T style seems to dominate on UNIX-like systems while the
other one is widespread where Windows reigns.
Abbreviations used:
src source operand
dst destination operand
mm MMX register
int32 integer register (32-bit)
imm8 immediate value (8-bit)
mem32 adjacent 32 bits in memory
mem64 adjacent 64 bits in memory