History of general purpose CPUs
The history of general purpose CPUs is a continuation of the earlier
history of computing hardware.
1950s: early designs
Each of the computer designs of the early 1950s was a unique design; there were no upward-compatible machines or computer architectures with multiple, differing implementations. Programs written for one machine would not run on another kind, even other kinds from the same company. This was not a major drawback at the time because there was not a large body of software developed to run on computers, so starting programming from scratch was not seen as a large barrier.
The design freedom of the time was very important, for designers were very constrained by the cost of electronics, yet just beginning to explore how a computer could best be organized. Some of the basic features introduced during this period included
index registers(on the Ferranti Mark I), a return-address saving instruction ( UNIVAC I), immediate operands ( IBM 704), and the detection of invalid operations ( IBM 650)
By the end of the 1950s commercial builders had developed factory-constructed, truck-deliverable computers. The most widely installed computer was the
IBM 650, which used drum memoryonto which programs were loaded using either paper tape or punch cards. Some very high-end machines also included core memorywhich provided higher speeds. Hard disks were also starting to become popular.
Computers are automatic abaci. The type of number system affects the way they work. In the early 1950s most computers were built for specific numerical processing tasks, and many machines used decimal numbers as their basic number system – that is, the mathematical functions of the machines worked in base-10 instead of base-2 as is common today. These were not merely
binary coded decimal. Most machines actually had ten vacuum tubes per digit in each register. Some early Soviet computer designers implemented systems based on ternary logic; that is, a bit could have three states: +1, 0, or -1, corresponding to positive, zero, or negative voltage.
An early project for the
U.S. Air Force, BINACattempted to make a lightweight, simple computer by using binary arithmetic. It deeply impressed the industry.
As late as 1970, major computer languages were unable to standardize their numeric behavior because decimal computers had groups of users too large to alienate.
Even when designers used a binary system, they still had many odd ideas. Some used sign-magnitude arithmetic (-1 = 10001), or
ones' complement(-1 = 11110), rather than modern two's complementarithmetic (-1 = 11111). Most computers used six-bit character sets, because they adequately encoded Hollerithcards. It was a major revelation to designers of this period to realize that the data word should be a multiple of the character size. They began to design computers with 12, 24 and 36 bit data words (e.g. see the TX-2).
In this era,
Grosch's lawdominated computer design: Computer cost increased as the square of its speed.
1960s: the computer revolution and CISC
One major problem with early computers was that a program for one would not work on others. Computer companies found that their customers had little reason to remain loyal to a particular brand, as the next computer they purchased would be incompatible anyway. At that point, price and performance were usually the only concerns.
In 1962, IBM tried a new approach to designing computers. The plan was to make an entire family of computers that could all run the same software, but with different performances, and at different prices. As users' requirements grew they could move up to larger computers, and still keep all of their investment in programs, data and storage media.
In order to do this they designed a single "reference computer" called the
System/360(or S/360). The System/360 was a virtual computer, a reference instruction set and capabilities that all machines in the family would support. In order to provide different classes of machines, each computer in the family would use more or less hardware emulation, and more or less microprogramemulation, to create a machine capable of running the entire System/360 instruction set.
For instance a low-end machine could include a very simple processor for low cost. However this would require the use of a larger microcode emulator to provide the rest of the instruction set, which would slow it down. A high-end machine would use a much more complex processor that could directly process more of the System/360 design, thus running a much simpler and faster emulator.
IBM chose to make the reference
instruction setquite complex, and very capable. This was a conscious choice. Even though the computer was complex, its " control store" containing the microprogramwould stay relatively small, and could be made with very fast memory. Another important effect was that a single instruction could describe quite a complex sequence of operations. Thus the computers would generally have to fetch fewer instructions from the main memory, which could be made slower, smaller and less expensive for a given combination of speed and price.
As the S/360 was to be a successor to both scientific machines like the 7090 and data processing machines like the 1401, it needed a design that could reasonably support all forms of processing. Hence the instruction set was designed to manipulate not just simple binary numbers, but text, scientific floating-point (similar to the numbers used in a calculator), and the
binary coded decimalarithmetic needed by accounting systems.
Almost all following computers included these innovations in some form. This basic set of features is now called a "
complex instruction set computer," or CISC (pronounced "sisk"), a term not invented until many years later.
In many CISCs, an instruction could access either registers or memory, usually in several different ways.This made the CISCs easier to program, because a programmer could remember just thirty to a hundred instructions, and a set of three to ten
addressing modes rather than thousands of distinct instructions.This was called an " orthogonal instruction set."The PDP-11and Motorola 68000architecture are examples of nearly orthogonal instruction sets.
The Burroughs Corporation (which later merged with Sperry/Univac to become
Unisys) offered an alternative to S/360 with their B5000 series machines. In 1961, the B5000 had virtual memory, symmetric multiprocessing, a multi-programming operating system (Master Control Program or MCP), written in ALGOL 60, and the industry's first recursive-descent compilers as early as 1963.
1970s: Large Scale Integration
In the 1960s, the
Apollo guidance computerand Minuteman missilemade the integrated circuiteconomical and practical.
Around 1971, the first calculator and clock chips began to show that very small computers might be possible. The first
microprocessorwas the 4004, designed in 1971 for a calculator company ( Busicom), and produced by Intel. In 1972, Intel introduced a microprocessor having a different architecture: the 8008. The 8008 is the direct ancestor of the current Intel Core 2, even now maintaining code compatibility (every instruction of the 8008's instruction set has a direct equivalent in the Intel Core 2's much larger instruction set, although the opcode values are different).
By the mid-1970s, the use of integrated circuits in computers was commonplace. The whole decade consists of upheavals caused by the shrinking price of transistors.
It became possible to put an entire CPU on a single printed circuit board. The result was that minicomputers, usually with 16-bit words, and 4k to 64K of memory, came to be commonplace.
CISCs were believed to be the most powerful types of computers, because their microcode was small and could be stored in very high-speed memory. The CISC architecture also addressed the "semantic gap" as it was perceived at the time. This was a defined distance between the machine language, and the higher level language people used to program a machine. It was felt that compilers could do a better job with a richer instruction set.
Custom CISCs were commonly constructed using "bit slice" computer logic such as the AMD 2900 chips, with custom microcode. A bit slice component is a piece of an ALU, register file or microsequencer. Most bit-slice integrated circuits were 4-bits wide.
By the early 1970s, the
PDP-11was developed, arguably the most advanced small computer of its day. Almost immediately, wider-word CISCs were introduced, the 32-bit VAXand 36-bit PDP-10.
Also, to control a
cruise missile, Intel developed a more-capable version of its 8008 microprocessor, the 8080.
IBM continued to make large, fast computers. However the definition of large and fast now meant more than a megabyte of RAM, clock speeds near one megahertz [http://www.hometoys.com/mentors/caswell/sep00/trends01.htm] [http://research.microsoft.com/users/GBell/Computer_Structures_Principles_and_Examples/csp0727.htm] , and tens of megabytes of disk drives.
IBM's System 370 was a version of the 360 tweaked to run virtual computing environments. The virtual computer was developed in order to reduce the possibility of an unrecoverable software failure.
The Burroughs B5000/B6000/B7000 series reached its largest market share. It was a stack computer whose OS was programmed in a dialect of Algol.
All these different developments competed for market share.
Early 1980s: the lessons of RISC
In the early 1980s, researchers at
UC Berkeleyand IBMboth discovered that most computer language compilers and interpreters used only a small subset of the instructions of a CISC. Much of the power of the CPU was simply being ignored in real-world use. They realized that by making the computer simpler and less orthogonal, they could make it faster and less expensive at the same time.
At the same time, CPU calculation became faster in relation to the time for a necessary memory accesses. Designers also experimented with using large sets of internal registers. The idea was to
cacheintermediate results in the registers under the control of the compiler.This also reduced the number of addressing modes and orthogonality.
The computer designs based on this theory were called
Reduced Instruction Set Computers, or RISC. RISCs generally had larger numbers of registers, accessed by simpler instructions, with a few instructions specifically to load and store data to memory. The result was a very simple core CPU running at very high speed, supporting the exact sorts of operations the compilers were using anyway.
A common variation on the RISC design employs the
Harvard architecture, as opposed to the Von Neumann or Stored Program architecture common to most other designs. In a Harvard Architecture machine, the program and data occupy separate memory devices and can be accessed simultaneously. In Von Neumann machines the data and programs are mixed in a single memory device, requiring sequential accessing which produces the so-called "Von Neumann bottleneck."
One downside to the RISC design has been that the programs that run on them tend to be larger. This is because
compilers have to generate longer sequences of the simpler instructions to accomplish the same results. Since these instructions need to be loaded from memory anyway, the larger code size offsets some of the RISC design's fast memory handling.
Recently, engineers have found ways to compress the reduced instruction sets so they fit in even smaller memory systems than CISCs. Examples of such compression schemes include the ARM's "Thumb" instruction set. In applications that do not need to run older binary software, compressed RISCs are coming to dominate sales.
Another approach to RISCs was the MISC, "
niladic" or "zero-operand" instruction set. This approach realized that the majority of space in an instruction was to identify the operands of the instruction. These machines placed the operands on a push-down (last-in, first out) stack. The instruction set was supplemented with a few instructions to fetch and store memory. Most used simple caching to provide extremely fast RISC machines, with very compact code. Another benefit was that the interrupt latencies were extremely small, smaller than most CISC machines (a rare trait in RISC machines). The Burroughs large systemsarchitecture uses this approach. The B5000 was designed in 1961, long before the term "RISC" was invented. The architecture puts six 8-bit instructions in a 48-bit word, and was a precursor to VLIWdesign (see below: 1990 to Today).
The Burroughs architecture was one of the inspirations for
Charles H. Moore's Forth programming language, which in turn inspired his later MISC chip designs. For example, his f20 cores had 31 5-bit instructions, which were fit four to a 20-bit word.
RISC chips now dominate the market for 32-bit embedded systems. Smaller RISC chips are even becoming common in the cost-sensitive 8-bit embedded-system market. The main market for RISC CPUs has been systems that require low power or small size.
Even some CISC processors (based on architectures that were created before RISC became dominant) translate instructions internally into a RISC-like instruction set. These CISC chips include newer
These numbers may surprise many, because the "market" is perceived to be desktop computers. With Intel x86 designs dominating the vast majority of all desktop sales, RISC is found in some of the Apple, Sun and
SGIdesktop computer lines. However, desktop computers are only a tiny fraction of the computers now sold. Most people in industrialised countries own more computers in embedded systems in their car and house than on their desks.
Mid-1980s to today: exploiting instruction level parallelism
In the mid-to-late 1980s, designers began using a technique known as "
instruction pipelining", in which the processor works on multiple instructions in different stages of completion. For example, the processor may be retrieving the operands for the next instruction while calculating the result of the current one. Modern CPUs may use over a dozen such stages. MISC processors achieve single-cycle execution of instructions without the need for pipelining.
A similar idea, introduced only a few years later, was to execute multiple instructions in parallel on separate
arithmetic logic units (ALUs). Instead of operating on only one instruction at a time, the CPU will look for several similar instructions that are not dependent on each other, and execute them in parallel. This approach is known as superscalarprocessor design.
Such techniques are limited by the degree of
instruction level parallelism(ILP), the number of non-dependent instructions in the program code. Some programs are able to run very well on superscalar processors due to their inherent high ILP, notably graphics. However more general problems do not have such high ILP, thus making the achievable speedups due to these techniques to be lower.
Branching is one major culprit. For example, the program might add two numbers and branch to a different code segment if the number is bigger than a third number. In this case even if the branch operation is sent to the second ALU for processing, it still must wait for the results from the addition. It thus runs no faster than if there were only one ALU. The most common solution for this type of problem is to use a type of
To further the efficiency of multiple functional units which are available in
superscalardesigns, operand register dependencies was found to be another limiting factor. To minimize these dependencies, out-of-order executionof instructions was introduced. In such a scheme, the instruction results which complete out-of-order must be re-ordered in program order by the processor for the program to be restartable after an exception. "Out-of-Order" execution was the main advancement of the computer industry during the 1990s.A similar concept is speculative execution, where instructions from one direction of a branch (the predicted direction) are executed before the branch direction is known. When the branch direction is known, the predicted direction and the actual direction are compared. If the predicted direction was correct, the speculatively-executed instructions and their results are kept; if it was incorrect, these instructions and their results are thrown out. Speculative execution coupled with an accurate branch predictor gives a large performance gain.
These advances, which were originally developed from research for
RISC-style designs, allow modern CISC processors to execute twelve or more instructions per clock cycle, when traditional CISC designs could take twelve or more cycles to execute just one instruction.
The resulting instruction scheduling logic of these processors is large, complex and difficult to verify. Furthermore, the higher complexity requires more transistors, increasing power consumption and heat. In this respect RISC is superior because the instructions are simpler, have less interdependence and make superscalar implementations easier. However, as Intel has demonstrated, the concepts can be applied to a CISC design, given enough time and money.
:Historical note: Some of these techniques (e.g. pipelining) were originally developed in the late 1950s by IBM on their Stretch mainframe computer.
1990 to today: looking forward
VLIW and EPIC
The instruction scheduling logic that makes a superscalar processor is just boolean logic. In the early 1990s, a significant innovation was to realize that the coordination of a multiple-ALU computer could be moved into the
compiler, the software that translates a programmer's instructions into machine-level instructions.
This type of computer is called a
very long instruction word(VLIW) computer.
Statically scheduling the instructions in the compiler (as opposed to letting the processor do the scheduling dynamically) can reduce CPU complexity. This can improve performance, reduce heat, and reduce cost.
Unfortunately, the compiler lacks accurate knowledge of runtime scheduling issues. Merely changing the CPU core frequency multiplier will have an effect on scheduling. Actual operation of the program, as determined by input data, will have major effects on scheduling. To overcome these severe problems a VLIW system may be enhanced by adding the normal dynamic scheduling, losing some of the VLIW advantages.
Static scheduling in the compiler also assumes that dynamically generated code will be uncommon. Prior to the creation of Java, this was in fact true. It was reasonable to assume that slow compiles would only affect software developers. Now, with JIT virtual machines for Java and
.NET, slow code generation affects users as well.
There were several unsuccessful attempts to commercialize VLIW. The basic problem is that a VLIW computer does not scale to different price and performance points, as a dynamically scheduled computer can. Another issue is that compiler design for VLIW computers is extremely difficult, and the current crop of compilers (as of 2005) don't always produce optimal code for these platforms.
Also, VLIW computers optimise for throughput, not low latency, so they were not attractive to the engineers designing controllers and other computers embedded in machinery. The
embedded systems markets had often pioneered other computer improvements by providing a large market that did not care about compatibility with older software.
In January 2000, a company called
Transmetatook the interesting step of placing a compiler in the central processing unit, and making the compiler translate from a reference byte code (in their case, x86instructions) to an internal VLIW instruction set. This approachcombines the hardware simplicity, low power and speed of VLIW RISC with the compact main memory system and software reverse-compatibility providedby popular CISC. Intelreleased a chip, called the Itanium, based on what they call an Explicitly Parallel Instruction Computing(EPIC) design. This design supposedly provides the VLIW advantage of increased instruction throughput. However, it avoids some of the issues of scaling and complexity, by explicitly providing in each "bundle" of instructions information concerning their dependencies. This information is calculated by the compiler, as it would be in a VLIW design. The early versions are also backward-compatible with current x86software by means of an on-chip emulationmode. Integer performance was disappointing and despite improvements, sales in volume markets continue to be low.
Current designs work best when the computer is running only a single program, however nearly all modern
operating systems allow the user to run multiple programs at the same time. For the CPU to change over and do work on another program requires expensive context switching. In contrast, multi-threaded CPUs can handle instructions from multiple programs at once.
To do this, such CPUs include several sets of registers. When a context switch occurs, the contents of the "working registers" are simply copied into one of a set of registers for this purpose.
Such designs often include thousands of registers instead of hundreds as in a typical design. On the downside, registers tend to be somewhat expensive in chip space needed to implement them. This chip space might otherwise be used for some other purpose.
Multi-core CPUs are typically multiple CPU cores on the same die, connected to each other via a shared L2 or L3 cache, an on-die bus, or an on-die
crossbar switch. All the CPU cores on the die share interconnect components with which to interface to other processors and the rest of the system. These components may include a front side businterface, a memory controllerto interface with DRAM, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The terms multi-coreand MPU(which stands for Micro-Processor Unit) have come into general usage for a single die that contains multiple CPU cores.
One way to work around the
von Neumann bottleneckis to mix the a processor and DRAM all one one chip.
* The Berkeley Intelligent RAM (IRAM) project [http://iram.cs.berkeley.edu/]
Another track of development is to combine reconfigurable logic with a general-purpose CPU. In this scheme, a special computer language compiles fast-running subroutines into a bit-mask to configure the logic. Slower, or less-critical parts of the program can be run by sharing their time on the CPU. This process has the capability to create devices such as software
radios, by using digital signal processing to perform functions usually performed by analog electronics.
Open source processors
As the lines between hardware and software increasingly blur due to progress in design methodology and availability of chips such as
FPGAs and cheaper production processes, even open source hardwarehas begun to appear. Loosely-knit communities like OpenCoreshave recently announced completely open CPU architectures such as the OpenRISCwhich can be readily implemented on FPGAs or in custom produced chips, by anyone, without paying license fees, and even established processor manufacturers like Sun Microsystemshave released processor designs (e.g. OpenSPARC) under open-source licenses.
Yet another possibility is the "clockless CPU" (asynchronous CPU). Unlike conventional processors, clockless processors have no central clock to coordinate the progress of data through the pipeline.Instead, stages of the CPU are coordinated using logic devices called "pipe line controls" or "FIFO sequencers." Basically, the pipeline controller clocks the next stage of logic when the existing stage is complete. In this way, a central clock is unnecessary.
It might be easier to implement high performance devices in asynchronous logic as opposed to clocked logic:
* components can run at different speeds in the clockless CPU. In a clocked CPU, no component can run faster than the clock rate.
* In a clocked CPU, the clock can go no faster than the worst-case performance of the slowest stage. In a clockless CPU, when a stage finishes faster than normal, the next stage can immediately take the results rather than waiting for the next clock tick. A stage might finish faster than normal because of the particular data inputs (multiplication can be very fast if it is multiplying by 0 or 1), or because it is running at a higher voltage or lower temperature than normal.Asynchronous logic proponents believe these capabilities would have these benefits:
* lower power dissipation for a given performance level
* highest possible execution speeds
The biggest disadvantage of the clockless CPU is that most CPU design tools assume a clocked CPU (a
synchronous circuit), so making a clockless CPU (designing an asynchronous circuit) involves modifying the design tools to handle clockless logic and doing extra testing to ensure the design avoids metastable problems.
In spite of those disadvantages, several asynchronous CPUs have been built, including
ORDVACand the identical ILLIAC I(1951)
ILLIAC II(1962), the fastest computer in the world at the time
* The Caltech Asynchronous Microprocessor, the world-first asynchronous microprocessor (1988)
* the ARM-implementing AMULET (1993 and 2000)
* the asynchronous implementation of MIPS R3000, dubbed [http://www.async.caltech.edu/mips.html MiniMIPS] (1998)
* the SEAforth
multi-coreprocessor from Charles H. Moore[ [http://www.intellasys.net/index.php?option=com_content&task=view&id=21&Itemid=41 SEAforth Overview] "... asynchronous circuit design throughout the chip. There is no central clock with billions of dumb nodes dissipating useless power. ... the processor cores are internally asynchronous themselves."]
One interesting possibility would be to eliminate the
front side bus. Modern vertical laser diodes enable this change. In theory, an optical computer's components could directly connect through a holographic or phased open-air switching system. This would provide a large increase in effective speed and design flexibility, and a large reduction in cost. Since a computer's connectors are also its most likely failure point, a busless system might be more reliable, as well.
Another farther-term possibility is to use light instead of electricity for the digital logic itself.In theory, this could run about 30% faster and use less power, as well as permit a direct interface with quantum computational devices.The chief problem with this approach is that for the foreseeable future, electronic devices are faster, smaller (i.e. cheaper) and more reliable.An important theoretical problem is that electronic computational elements are already smaller than some wavelengths of light, and therefore even wave-guide based optical logic may be uneconomic compared to electronic logic.The majority of development effort, as of 2006 is focused on electronic circuitry.See also
* [http://www-128.ibm.com/developerworks/library/pa-microhist.html Great moments in microprocessor history by W. Warner, 2004]
* [http://jbayko.sasktelwebsite.net/cpu.html Great Microprocessors of the Past and Present (V 13.4.0) by: John Bayko, 2003]
Wikimedia Foundation. 2010.
Look at other dictionaries:
History of computing hardware — Computing hardware is a platform for information processing (block diagram) The history of computing hardware is the record of the ongoing effort to make computer hardware faster, cheaper, and capable of storing more data. Computing hardware… … Wikipedia
History of IBM mainframe operating systems — The history of operating systems running on IBM mainframes is a notable chapter of history of mainframe operating systems, because of IBM s long standing position as the world s largest hardware supplier of mainframe computers.Arguably the… … Wikipedia
History of Mac OS — On January 24, 1984, Apple Computer, Inc. (now Apple Inc.) introduced the Apple Macintosh personal computer, with the Macintosh 128K model, which came bundled with the Mac OS operating system, then known as the System Software .Fact|date=April… … Wikipedia
Central processing unit — CPU redirects here. For other uses, see CPU (disambiguation). An Intel 80486DX2 CPU from above An Intel 80486DX2 from below … Wikipedia
CPU design — is the design engineering task of creating a central processing unit (CPU), a component of computer hardware. It is a subfield of electronics engineering and computer engineering. Contents 1 Overview 2 Goals 3 Performance analysis and… … Wikipedia
Threaded code — Not to be confused with multi threaded programming. In computer science, the term threaded code refers to a compiler implementation technique where the generated code has a form that essentially consists entirely of calls to subroutines. The code … Wikipedia
Graphics processing unit — GPU redirects here. For other uses, see GPU (disambiguation). GeForce 6600GT (NV43) GPU A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized circuit designed to rapidly manipulate and alter… … Wikipedia
CPU cache — Cache memory redirects here. For the general use, see cache. A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the… … Wikipedia
SIMD — In computing, SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level parallelism, as in a vector processor. First made popular in large scale supercomputers (contrary to MIMD parallelization), smaller scale SIMD… … Wikipedia
Coprocessor — A coprocessor is a computer processor used to supplement the functions of the primary processor (the CPU). Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By … Wikipedia