In digital circuits, a flip-flop is a kind of bistable multivibrator, an electronic circuit which has two stable states and thereby is capable of serving as one bit of memory. Today, the term flip-flop has come to generally denote non-transparent (clocked or edge-triggered) devices, while the simpler transparent ones are often referred to as latches.
A flip-flop is controlled by (usually) one or two control signals and/or a gate or clock signal. The output often includes the complement as well as the normal output. As flip-flops are implemented electronically, they require power and ground connections.
Contents
[hide]
* 1 History
* 2 Implementation
* 3 Set-Reset flip-flops (SR flip-flops)
* 4 Toggle flip-flops (T flip-flops)
* 5 JK flip-flop
* 6 D flip-flop
* 7 Master-slave D flip-flop
o 7.1 Edge-triggered D flip-flop
* 8 Uses
* 9 Timing and metastability
* 10 Flip-flop integrated circuits
* 11 See also
* 12 Notes and references
* 13 External links
[edit] History
The first electronic flip-flop was invented in 1919 by William Eccles and F. W. Jordan.[1] It was initially called the Eccles-Jordan trigger circuit and consisted of two active elements (radio-tubes). The name flip-flop was later derived from the sound produced on a speaker connected with one of the back coupled amplifiers output during the trigger process within the circuit. This original electronic flip-flop was transparent - i.e. a simple two-input bistable circuit without any dedicated clock (or even gate) signal - and thus would probably have been labeled as a "latch" today.
[edit] Implementation
Flip-flops can be either simple (transparent) or clocked. Simple flip-flops can be built by two cross-coupled inverting elements – transistors, or NAND, or NOR-gates – perhaps augmented by some enable/disable (gating) mechanism. Clocked devices are specially designed for synchronous (time-discrete) systems and therefore one such device ignores its inputs except at the transition of a dedicated clock signal (known as clocking, pulsing, or strobing). This causes the flip-flop to either change or retain its output signal based upon the values of the input signals at the transition. Some flip-flops change output on the rising edge of the clock, others on the falling edge.
Clocked (non-transparent) flip-flops are typically implemented as master-slave devices[2] where two basic flip-flops (plus some additional logic) collaborate to make it insensitive to spikes and noise between the short clock transitions; they nevertheless also often include asynchronous clear or set inputs which may be used to change the current output independent of the clock.
Flip-flops can be further divided into types that have found common applicability in both asynchronous and clocked sequential systems: the SR ("set-reset"), D ("delay"[3]), T ("toggle"), and JK types are the common ones; all of which may be synthesized from (most) other types by a few logic gates. The behavior of a particular type can be described by what is termed the characteristic equation, which derives the "next" (i.e., after the next clock pulse) output, Qnext, in terms of the input signal(s) and/or the current output, Q.
[edit] Set-Reset flip-flops (SR flip-flops)
Main article: SR latch
The symbol for an SR latch.
The symbol for an SR latch.
The most fundamental latch is the simple SR latch (or simple SR flip-flop), where S and R stand for set and reset. It can be constructed from a pair of cross-coupled NOR (negative OR) logic gates. The stored bit is present on the output marked Q.
Normally, in storage mode, the S and R inputs are both low, and feedback maintains the Q and Q outputs in a constant state, with Q the complement of Q. If S (Set) is pulsed high while R is held low, then the Q output is forced high, and stays high even after S returns low; similarly, if R (Reset) is pulsed high while S is held low, then the Q output is forced low, and stays low even after R returns low.
SR latch operation
S R Action
0 0 Keep state
0 1 Q = 0
1 0 Q = 1
1 1 Unstable combination,
see race condition
[edit] Toggle flip-flops (T flip-flops)
A circuit symbol for a T-type flip-flop, where > is the clock input, T is the toggle input and Q is the stored data output.
A circuit symbol for a T-type flip-flop, where > is the clock input, T is the toggle input and Q is the stored data output.
If the T input is high, the T flip-flop changes state ("toggles") whenever the clock input is strobed. If the T input is low, the flip-flop holds the previous value. This behavior is described by the characteristic equation:
Q_{next} = T \oplus Q (or, without benefit of the XOR operator, the equivalent: Q_{next} = T\overline{Q} + \overline{T}Q )
and can be described in a truth table:
T Q Qnext Comment
0 0 0 hold state(no clk)
0 1 1 hold state(no clk)
1 0 1 toggle
1 1 0 toggle
When T is held high, the toggle flip-flop divides the clock frequency by two; that is, if clock frequency is 4 MHz, the output frequency obtained from the flip-flop will be 2 MHz. This 'divide by' feature has application in various types of digital counters. A T flip-flop can also be built using a JK flip-flop (J & K pins are connected together and act as T) or D flip-flop (T input and Qprevious is connected to the D input through an XOR gate).
[edit] JK flip-flop
JK flip-flop timing diagram
JK flip-flop timing diagram
The JK flip-flop augments the behavior of the SR flip-flop by interpreting the S = R = 1 condition as a "flip" or toggle command. Specifically, the combination J = 1, K = 0 is a command to set the flip-flop; the combination J = 0, K = 1 is a command to reset the flip-flop; and the combination J = K = 1 is a command to toggle the flip-flop, i.e., change its output to the logical complement of its current value. Setting J = K = 0 does NOT result in a D flip-flop, but rather, will hold the current state. To synthesize a D flip-flop, simply set K equal to the complement of J. The JK flip-flop is therefore a universal flip-flop, because it can be configured to work as an SR flip-flop, a D flip-flop, or a T flip-flop. NOTE: The flip flop is positive edge triggered (Clock Pulse) as seen in the timing diagram.
A circuit symbol for a JK flip-flop, where > is the clock input, J and K are data inputs, Q is the stored data output, and Q' is the inverse of Q.
A circuit symbol for a JK flip-flop, where > is the clock input, J and K are data inputs, Q is the stored data output, and Q' is the inverse of Q.
The characteristic equation of the JK flip-flop is:
Q_{next} = J\overline Q + \overline KQ
and the corresponding truth table is:
J K Qnext Comment
0 0 Q_{prev} \ hold state
0 1 0 \ reset
1 0 1 \ set
1 1 \overline{Q_{prev}} toggle
The origin of the name for the JK flip-flop is detailed by P. L. Lindley, a JPL engineer, in a letter to EDN, an electronics design magazine. The letter is dated June 13, 1968, and was published in the August edition of the newsletter. In the letter, Mr. Lindley explains that he heard the story of the JK flip-flop from Dr. Eldred Nelson, who is responsible for coining the term while working at Hughes Aircraft. Flip-flops in use at Hughes at the time were all of the type that came to be known as J-K. In designing a logical system, Dr. Nelson assigned letters to flip-flop inputs as follows: #1: A & B, #2: C & D, #3: E & F, #4: G & H, #5: J & K.
Another theory holds that the set and reset inputs were given the symbols "J" and "K" after one of the engineers that helped design the J-K flip-flop, Jack Kilby.
[edit] D flip-flop
D flip-flop symbol
D flip-flop symbol
The Q output always takes on the state of the D input at the moment of a rising clock edge, and never at any other time. [4] It is called the D flip-flop for this reason, since the output takes the value of the D input or Data input, and Delays it by one clock count. The D flip-flop can be interpreted as a primitive memory cell, zero-order hold, or delay line.
Truth table:
Clock D Q Qprev
Rising edge 0 0 X
Rising edge 1 1 X
Non-Rising X constant
('X' denotes a Don't care condition, meaning the signal is irrelevant)
These flip flops are very useful, as they form the basis for shift registers, which are an essential part of many electronic devices. The advantage of the D flip-flop over the D-type latch is that it "captures" the signal at the moment the clock goes high, and subsequent changes of the data line do not influence Q until the next rising clock edge. An exception is that some flip-flops have a 'reset' signal input, which will reset Q (to zero), and may be either asynchronous or synchronous with the clock.
3-bit shift register
3-bit shift register
The above circuit shifts the contents of the register to the right, one bit position on each active transition of the clock. The input X being shifted into the leftmost bit position.
[edit] Master-slave D flip-flop
A master-slave D flip-flop is created by connecting two gated D latches in series, and inverting the enable input to one of them. It is called master-slave because the second latch in the series only changes in response to a change in the first (master) latch.
A master slave D flip flop. It responds on the negative edge of the enable input (usually a clock).
A master slave D flip flop. It responds on the negative edge of the enable input (usually a clock).
For a positive-edge triggered master-slave D flip-flop, when the clock signal is low (logical 0) the “enable” seen by the first or “master” D latch (the inverted clock signal) is high (logical 1). This allows the “master” latch to store the input value when the clock signal transitions from low to high. As the clock signal goes high (0 to 1) the inverted “enable” of the first latch goes low (1 to 0) and the value seen at the input to the master latch is “locked”. Nearly simultaneously, the twice inverted “enable” of the second or “slave” D latch transitions from low to high (0 to 1) with the clock signal. This allows the signal captured at the rising edge of the clock by the now “locked” master latch to pass through the “slave” latch. When the clock signal returns to low (1 to 0), the output of the "slave" latch is "locked", and the value seen at the last rising edge of the clock is held while the “master” latch begins to accept new values in preparation for the next rising clock edge.
An implementation of a master-slave D flip-flop that is triggered on the positive edge of the clock.
An implementation of a master-slave D flip-flop that is triggered on the positive edge of the clock.
By removing the left-most inverter in the above circuit, a D-type flip flop that strobes on the falling edge of a clock signal can be obtained. This has a truth table like this:
D Q > Qnext
0 X Falling 0
1 X Falling 1
Most D-type flip-flops in ICs have the capability to be set and reset, much like an SR flip-flop. Usually, the illegal S = R = 1 condition is resolved in D-type flip-flops.
Inputs Outputs
S R D > Q Q'
0 1 X X 0 1
1 0 X X 1 0
1 1 X X 1 1
By setting S = R = 0, the flip-flop can be used as described above.
[edit] Edge-triggered D flip-flop
A more efficient way to make a D flip-flop is not as easy to understand, but it works the same way. While the master-slave D flip flop is also triggered on the edge of a clock, its components are each triggered by clock levels. The "edge-triggered D flip flop" does not have the master slave properties.
A positive-edge-triggered D flip-flop.
A positive-edge-triggered D flip-flop.
[edit] Uses
* A single flip-flop can be used to store one bit, or binary digit, of data.
* Static RAM, which is the primary type of memory used in registers to store numbers in computers and in many caches, is built out of flip-flops.
* Any one of the flip-flop types can be used to build any of the others.
* The data contained in several flip-flops may represent the state of a sequencer, the value of a counter, an ASCII character in a computer's memory or any other piece of information.
* One use is to build finite state machines from electronic logic. The flip-flops remember the machine's previous state, and digital logic uses that state to calculate the next state.
* The T flip-flop is useful for constructing various types of counters. Repeated signals to the clock input will cause the flip-flop to change state once per high-to-low transition of the clock input, if its T input is "1". The output from one flip-flop can be fed to the clock input of a second and so on. The final output of the circuit, considered as the array of outputs of all the individual flip-flops, is a count, in binary, of the number of cycles of the first clock input, up to a maximum of 2n-1, where n is the number of flip-flops used. See: Counters
* One of the problems with such a counter (called a ripple counter) is that the output is briefly invalid as the changes ripple through the logic. There are two solutions to this problem. The first is to sample the output only when it is known to be valid. The second, more widely used, is to use a different type of circuit called a synchronous counter. This uses more complex logic to ensure that the outputs of the counter all change at the same, predictable time. See: Counters
* Frequency division: a chain of T flip-flops as described above will also function to divide an input in frequency by 2n, where n is the number of flip-flops used between the input and the output.
[edit] Timing and metastability
A flip-flop in combination with a Schmitt trigger can be used for the implementation of an arbiter in asynchronous circuits.
Clocked flip-flops are prone to a problem called metastability, which happens when a data or control input is changing at the instant of the clock pulse. The result is that the output may behave unpredictably, taking many times longer than normal to settle to its correct state, or even oscillating several times before settling. Theoretically it can take infinite time to settle down. In a computer system this can cause corruption of data or a program crash.
Flip-flop setup, hold and clock-to-output timing parameters.
Flip-flop setup, hold and clock-to-output timing parameters.
The metastability in flip-flops can be avoided by ensuring that the data and control inputs are held valid and constant for specified periods before and after the clock pulse, called the setup time (tsu) and the hold time (th) respectively. These times are specified in the data sheet for the device, and are typically between a few nanoseconds and a few hundred picoseconds for modern devices.
Unfortunately, it is not always possible to meet the setup and hold criteria, because the flip-flop may be connected to a real-time signal that could change at any time, outside the control of the designer. In this case, the best the designer can do is to reduce the probability of error to a certain level, depending on the required reliability of the circuit. One technique for suppressing metastability is to connect two or more flip-flops in a chain, so that the output of each one feeds the data input of the next, and all devices share a common clock. With this method, the probability of a metastable event can be reduced to a negligible value, but never to zero. The probability of metastability gets closer and closer to zero as the number of flip-flops connected in series is increased.
So-called metastable-hardened flip-flops are available, which work by reducing the setup and hold times as much as possible, but even these cannot eliminate the problem entirely. This is because metastability is more than simply a matter of circuit design. When the transitions in the clock and the data are close together in time, the flip-flop is forced to decide which event happened first. However fast we make the device, there is always the possibility that the input events will be so close together that it cannot detect which one happened first. It is therefore logically impossible to build a perfectly metastable-proof flip-flop.
Another important timing value for a flip-flop is the clock-to-output delay (common symbol in data sheets: tCO) or propagation delay (tP), which is the time the flip-flop takes to change its output after the clock edge. The time for a high-to-low transition (tPHL) is sometimes different from the time for a low-to-high transition (tPLH).
When connecting flip-flops in a chain, it is important to ensure that the tCO of the first flip-flop is longer than the hold time (tH) of the second flip-flop, otherwise the second flip-flop will not receive the data reliably. The relationship between tCO and tH is normally guaranteed if both flip-flops are of the same type.
[edit] Flip-flop integrated circuits
Integrated circuit (ICs) exist that provide one or more flip-flops. For example, the 7473 Dual JK Master-Slave Flip-flop or the 74374, an octal D Flip-flop, in the 7400 series.
воскресенье, 11 мая 2008 г.
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
Pipelining assumes that with a single instruction (SISD) concept successive instructions in a program sequence will overlap in execution, as suggested in the next diagram (vertical 'i' instructions, horizontal 't' time).
Most modern CPUs are driven by a clock. The CPU consists internally of logic and flip flops. When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back
Hazards: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is inputted to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Contents
[hide]
* 1 Advantages and Disadvantages
* 2 Examples
o 2.1 Generic pipeline
+ 2.1.1 Bubble
o 2.2 Example 1
o 2.3 Example 2
* 3 Complications
* 4 See also
* 5 External links
[edit] Advantages and Disadvantages
Pipelining does not help in all cases. There are several disadvantages associated. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.
Advantages of Pipelining:
1. The cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.
Disadvantages of Pipelining:
1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.
3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
[edit] Examples
[edit] Generic pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
To the right is a generic pipeline with four stages:
1. Fetch
2. Decode
3. Execute
4. Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time Execution
0 Four instructions are awaiting to be executed
1
* the green instruction is fetched from memory
2
* the green instruction is decoded
* the purple instruction is fetched from memory
3
* the green instruction is executed (actual operation is performed)
* the purple instruction is decoded
* the blue instruction is fetched
4
* the green instruction's results are written back to the register file or memory
* the purple instruction is executed
* the blue instruction is decoded
* the red instruction is fetched
5
* the green instruction is completed
* the purple instruction is written back
* the blue instruction is executed
* the red instruction is decoded
6
* The purple instruction is completed
* the blue instruction is written back
* the red instruction is executed
7
* the blue instruction is completed
* the red instruction is written back
8
* the red instruction is completed
9 All instructions are executed
[edit] Bubble
A bubble in cycle 3 delays execution
A bubble in cycle 3 delays execution
Main article: Bubble (computing)
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple instruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7.
Bubbles are unlike stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code.
[edit] Example 1
A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to:
LOAD A, R1
LOAD B, R2
ADD R1, R2, R3
STORE R3, C
LOAD next instruction
The locations 'R1' and 'R2' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'.
In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages.
On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage.
Pipelining doesn't reduce the time it takes to complete an instruction rather it increases the number of instructions that can be processed at once and it reduces the delay between completed instructions- called 'throughput'. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline). Intel Pentium 4 processors have 20 stage pipelines.
[edit] Example 2
To better visualize the concept, we can look at a theoretical 3-stage pipeline:
Stage Description
Load Read instruction from memory
Execute Execute instruction
Store Store result in memory and/or registers
and a pseudo-code assembly listing to be executed:
LOAD #40, A ; load 40 in A
MOVE A, B ; copy A in B
ADD #20, B ; add 20 to B
STORE B, 0x300 ; store B into memory cell 0x300
This is how it would be executed:
Clock 1 Load Execute Store
LOAD
The LOAD instruction is fetched from memory.
Clock 2 Load Execute Store
MOVE LOAD
The LOAD instruction is executed, while the MOVE instruction is fetched from memory.
Clock 3 Load Execute Store
ADD MOVE LOAD
The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction.
Clock 4 Load Execute Store
STORE ADD MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating.
And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do.
[edit] Complications
Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4) The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long [1]. The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch predicting itself can end up exacerbating the problem if branches are predicted poorly. In certain applications, such as supercomputing, programs are specially written to rarely branch and so very long pipelines are ideal to speed up the computations, as long pipelines are designed to reduce clocks per instruction (CPI). If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis, however such analysis is often a last-resort for optimization.
The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to "work" again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages.
Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs.
Pipelining assumes that with a single instruction (SISD) concept successive instructions in a program sequence will overlap in execution, as suggested in the next diagram (vertical 'i' instructions, horizontal 't' time).
Most modern CPUs are driven by a clock. The CPU consists internally of logic and flip flops. When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back
Hazards: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is inputted to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Contents
[hide]
* 1 Advantages and Disadvantages
* 2 Examples
o 2.1 Generic pipeline
+ 2.1.1 Bubble
o 2.2 Example 1
o 2.3 Example 2
* 3 Complications
* 4 See also
* 5 External links
[edit] Advantages and Disadvantages
Pipelining does not help in all cases. There are several disadvantages associated. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.
Advantages of Pipelining:
1. The cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.
Disadvantages of Pipelining:
1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.
3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
[edit] Examples
[edit] Generic pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
To the right is a generic pipeline with four stages:
1. Fetch
2. Decode
3. Execute
4. Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time Execution
0 Four instructions are awaiting to be executed
1
* the green instruction is fetched from memory
2
* the green instruction is decoded
* the purple instruction is fetched from memory
3
* the green instruction is executed (actual operation is performed)
* the purple instruction is decoded
* the blue instruction is fetched
4
* the green instruction's results are written back to the register file or memory
* the purple instruction is executed
* the blue instruction is decoded
* the red instruction is fetched
5
* the green instruction is completed
* the purple instruction is written back
* the blue instruction is executed
* the red instruction is decoded
6
* The purple instruction is completed
* the blue instruction is written back
* the red instruction is executed
7
* the blue instruction is completed
* the red instruction is written back
8
* the red instruction is completed
9 All instructions are executed
[edit] Bubble
A bubble in cycle 3 delays execution
A bubble in cycle 3 delays execution
Main article: Bubble (computing)
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple instruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7.
Bubbles are unlike stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code.
[edit] Example 1
A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to:
LOAD A, R1
LOAD B, R2
ADD R1, R2, R3
STORE R3, C
LOAD next instruction
The locations 'R1' and 'R2' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'.
In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages.
On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage.
Pipelining doesn't reduce the time it takes to complete an instruction rather it increases the number of instructions that can be processed at once and it reduces the delay between completed instructions- called 'throughput'. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline). Intel Pentium 4 processors have 20 stage pipelines.
[edit] Example 2
To better visualize the concept, we can look at a theoretical 3-stage pipeline:
Stage Description
Load Read instruction from memory
Execute Execute instruction
Store Store result in memory and/or registers
and a pseudo-code assembly listing to be executed:
LOAD #40, A ; load 40 in A
MOVE A, B ; copy A in B
ADD #20, B ; add 20 to B
STORE B, 0x300 ; store B into memory cell 0x300
This is how it would be executed:
Clock 1 Load Execute Store
LOAD
The LOAD instruction is fetched from memory.
Clock 2 Load Execute Store
MOVE LOAD
The LOAD instruction is executed, while the MOVE instruction is fetched from memory.
Clock 3 Load Execute Store
ADD MOVE LOAD
The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction.
Clock 4 Load Execute Store
STORE ADD MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating.
And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do.
[edit] Complications
Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4) The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long [1]. The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch predicting itself can end up exacerbating the problem if branches are predicted poorly. In certain applications, such as supercomputing, programs are specially written to rarely branch and so very long pipelines are ideal to speed up the computations, as long pipelines are designed to reduce clocks per instruction (CPI). If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis, however such analysis is often a last-resort for optimization.
The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to "work" again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages.
Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs.
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are to cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.
When the processor wishes to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.
The diagram on the right shows two memories. Each location in each memory has a datum (a cache line), which in different designs ranges in size from 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested by a CPU instruction, which ranges from 1 to 16 bytes. Each location in each memory also has an index, which is a unique number used to refer to that location. The index for a location in main memory is called an address. Each location in the cache has a tag which contains the index of the datum in main memory which has been cached. In a CPU's data cache these entries are called cache lines or cache blocks.
Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer used to speed up virtual-to-physical address translation for both executable instructions and data.
However, of all microprocessor units sold (including embedded processors), most of them do not have any cache[1] -- mostly to reduce cost, but sometimes to improve the determinism of a real-time computing system.
Contents
[hide]
* 1 Details of operation
* 2 Associativity
o 2.1 Pseudo-associative cache
* 3 Cache misses
* 4 Address translation
o 4.1 Virtual indexing and virtual aliases
o 4.2 Virtual tags and vhints
o 4.3 Page coloring
* 5 Cache hierarchy in a modern processor
o 5.1 Specialized caches
+ 5.1.1 Victim cache
+ 5.1.2 Trace cache
o 5.2 Multi-level caches
+ 5.2.1 Exclusive versus inclusive
o 5.3 Example: the K8
o 5.4 More hierarchies
* 6 Implementation
* 7 History
o 7.1 History of cache in x86 architecture
* 8 See also
* 9 Notes and references
* 10 External links
[edit] Details of operation
When the processor wishes to read or write a location in main memory, it first checks whether that memory location is in the cache. This is accomplished by comparing the address of the memory location to all tags in the cache that might contain that address. If the processor finds that the memory location is in the cache, we say that a cache hit has occurred, otherwise we speak of a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. The proportion of accesses that result in a cache hit is known as the hit rate, and is a measure of the effectiveness of the cache.
In the case of a cache miss, most caches allocate a new entry, which comprises the tag just missed and a copy of the data from memory. The reference can then be applied to the new entry just as in the case of a hit. Misses are comparatively slow because they require the data to be transferred from main memory. This transfer incurs a delay since main memory is much slower than cache memory, and also incurs the overhead for recording the new data in the cache before it is delivered to the processor.
In order to make room for the new entry on a cache miss, the cache generally has to evict one of the existing entries. The heuristic that it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, especially for hardware caches which use simple rules amenable to implementation in circuitry, so there are a variety of replacement policies to choose from and no perfect way to decide among them. One popular replacement policy, LRU, replaces the least recently used entry.
When data is written to the cache, it must at some point be written to main memory as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to memory. Instead, the cache tracks which locations have been written over (these locations are marked dirty). The data in these locations is written back to main memory when that data is evicted from the cache. For this reason, a miss in a write-back cache will often require two memory accesses to service: one to first write the dirty location to memory and then another to read the new location from memory.
There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so that multiple stores can be processed together (which can reduce bus turnarounds and so improve bus utilization).
The data in main memory being cached may be changed by other entities, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the CPU updates the data in the cache, copies of data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as cache coherence protocols.
The time taken to fetch a datum from memory (read latency) matters because a CPU will often run out of things to do while waiting for the datum. When a CPU reaches this state, it is called a stall. As CPUs become faster, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single datum from memory. Various techniques have been employed to keep the CPU busy during this time. Out-of-order CPUs (Pentium Pro and later Intel's designs, for example) attempt to execute independent instructions after the instruction which is waiting for the cache miss data. Another technology, used by many processors, is simultaneous multithreading (SMT), or in Intel's terminology hyper-threading (HT), which allows an alternate thread to use the CPU core while a first thread waits for data to come from main memory.
Associativity
Which memory locations can be cached by which cache locations
Which memory locations can be cached by which cache locations
The replacement policy decides where in the cache a copy of a particular entry of main memory will go. If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Many caches implement a compromise, and are described as set associative. For example, the level-1 data cache in an AMD Athlon is 2-way set associative, which means that any particular location in main memory can be cached in either of 2 locations in the level-1 data cache.
Associativity is a trade-off. If there are ten places the replacement policy can put a new cache entry, then when the cache is checked for a hit, all ten places must be searched. Checking more places takes more power, area, and potentially time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the CPU spends less time servicing those misses. The rule of thumb is that doubling the associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on the hit rate, and are generally done for other reasons (see virtual aliasing, below).
In order of increasing (worse) hit times and decreasing (better) miss rates,
* direct mapped cache -- the best (fastest) hit times, and so the best tradeoff for "large" caches
* 2-way set associative cache
* 2-way skewed associative cache -- "the best tradeoff for .... caches whose sizes are in the range 4K-8K bytes" -- André Seznec[2]
* 4-way set associative cache
* fully associative cache -- the best (lowest) miss rates, and so the best tradeoff when the miss penalty is very high
If each location in main memory can be cached in either of two locations in the cache, one logical question is: which two? The simplest and most commonly used scheme, shown in the right-hand diagram above, is to use the least significant bits of the memory location's index as the index for the cache memory, and to have two entries for each index. One good property of this scheme is that the tags stored in the cache do not have to include that part of the main memory address which is implied by the cache memory's index. Since the cache tags are fewer bits, they take less area [on the microprocessor chip] and can be read and compared faster.
One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that datum is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.
The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. This datum can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below.
Other schemes have been suggested, such as the skewed cache[2], where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function[3]. Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones.[4]
[edit] Pseudo-associative cache
A true set-associative cache tests all the possible ways simultaneously, using something like a content addressable memory. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache is one kind of pseudo-associative cache.
In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache. [3]
[edit] Cache misses
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.
A cache read miss from an instruction cache generally causes the most delay, because the processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from main memory.
A cache read miss from a data cache usually causes less delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution.
A cache write miss to a data cache generally causes the least delay, because the write can be queued and there are few limitations on the execution of subsequent instructions. The processor can continue until the queue is full.
In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an attempt to find the best combination of size, associativity, block size, and so on. Sequences of memory references performed by benchmark programs are saved as address traces. Subsequent analyses simulate many different possible cache designs on these long address traces. Making sense of how the many variables affect the cache hit rate can be quite confusing. One significant contribution to this analysis was made by Mark Hill, who separated misses into three categories (known as the Three Cs):
* Compulsory misses are those misses caused by the first reference to a datum. Cache size and associativity make no difference to the number of compulsory misses. Prefetching can help here, as can larger cache block sizes (which are a form of prefetching).
* Capacity misses are those misses that occur regardless of associativity or block size, solely due to the finite size of the cache. The curve of capacity miss rate versus cache size gives some measure of the temporal locality of a particular reference stream. Note that there is no useful notion of a cache being "full" or "empty" or "near capacity": CPU caches almost always have nearly every line filled with a copy of some line in main memory, and nearly every allocation of a new line requires the eviction of an old line.
* Conflict misses are those misses that could have been avoided, had the cache not evicted an entry earlier. Conflict misses can be further broken down into mapping misses, that are unavoidable given a particular amount of associativity, and replacement misses, which are due to the particular victim choice of the replacement policy.
Miss rate versus cache size on the Integer portion of SPEC CPU2000
Miss rate versus cache size on the Integer portion of SPEC CPU2000
The graph to the right summarizes the cache performance seen on the Integer portion of the SPEC CPU2000 benchmarks, as collected by Hill and Cantin [1]. These benchmarks are intended to represent the kind of workload that an engineering workstation computer might see on any given day. The reader should keep in mind that finding benchmarks which are even usefully representative of many programs has been very difficult, and there will always be important programs with very different behavior than what is shown here.
We can see the different effects of the three Cs in this graph.
At the far right, with cache size labelled "Inf", we have the compulsory misses. If we wish to improve a machine's performance on SpecInt2000, increasing the cache size beyond 1 MiB is essentially futile. That's the insight given by the compulsory misses.
The fully-associative cache miss rate here is almost representative of the capacity miss rate. The difference is that the data presented is from simulations assuming an LRU replacement policy. Showing the capacity miss rate would require a perfect replacement policy, i.e. an oracle that looks into the future to find a cache entry which is actually not going to be hit.
Note that our approximation of the capacity miss rate falls steeply between 32 KiB and 64 KiB. This indicates that the benchmark has a working set of roughly 64 KiB. A CPU cache designer examining this benchmark will have a strong incentive to set the cache size to 64 KiB rather than 32 KiB. Note that, on this benchmark, no amount of associativity can make a 32 KiB cache perform as well as a 64 KiB 4-way, or even a direct-mapped 128 KiB cache.
Finally, note that between 64 KiB and 1 MiB there is a large difference between direct-mapped and fully-associative caches. This difference is the conflict miss rate. The insight from looking at conflict miss rates is that secondary caches benefit a great deal from high associativity.
This benefit was well known in the late 80s and early 90s, when CPU designers could not fit large caches on-chip, and could not get sufficient bandwidth to either the cache data memory or cache tag memory to implement high associativity in off-chip caches. Desperate hacks were attempted: the MIPS R8000 used expensive off-chip dedicated tag SRAMs, which had embedded tag comparators and large drivers on the match lines, in order to implement a 4 MiB 4-way associative cache. The MIPS R10000 used ordinary SRAM chips for the tags. Tag access for both ways took two cycles. To reduce latency, the R10000 would guess which way of the cache would hit on each access.
[edit] Address translation
Main article: translation lookaside buffer
Most general purpose CPUs implement some form of virtual memory. To summarize, each program running on the machine sees its own simplified address space, which contains code and data for that program only. Each program places things in its address space without regard for what other programs are doing in their address spaces.
Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation is known as the memory management unit (MMU). The fast path through the MMU can perform those translations stored in the Translation Lookaside Buffer (TLB), which is a cache of mappings from the operating system's page table. ' For the purposes of the present discussion, there are three important features of address translation:
* Latency: The physical address is available from the MMU some time, perhaps a few cycles, after the virtual address is available from the address generator.
* Aliasing: Multiple virtual addresses can map to a single physical address. Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given time.
* Granularity: The virtual address space is broken up into pages. For instance, a 4 GiB virtual address space might be cut up into 1048576 4 KiB pages, each of which can be independently mapped. There may be multiple page sizes supported, see virtual memory for elaboration.
A historical note: the first virtual memory systems were very slow, because they required an access to the page table (held in main memory) before every programmed access to main memory. With no caches, this effectively cut the speed of the machine in half. The first hardware cache used in a computer system was not actually a data or instruction cache, but rather a TLB.
Virtually indexed caches use a portion of the virtual address for their index, which is available earlier than the physical address. If the cache is either direct mapped or virtually tagged, there is no need to consult the MMU to determine which data to feed back to the execution datapath, and so the cache can be very fast (especially if small, as were the primary data caches in the first two generations of the Intel Pentium 4). The speed of this recurrence (the load latency) is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM.
But virtual indexing is not the best choice for all cache levels. It introduces the problem of virtual aliases — the cache may have multiple locations which can store the value of a single physical address. The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed.
Caches have historically used both virtual and physical addresses for the cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging. Large caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints, as described below.
[edit] Virtual indexing and virtual aliases
The usual way the processor guarantees that virtually aliased addresses act as a single storage location is to arrange that only one virtual alias can be in the cache at any given time.
Whenever a new entry is added to a virtually-indexed cache, the processor searches for any virtual aliases already resident and evicts them first. This special handling happens only during a cache miss. No special work is necessary during a cache hit, which helps keep the fast path fast.
The most straightforward way to find aliases is to arrange for them all to map to the same location in the cache. This happens, for instance, if the TLB has e.g. 4 KiB pages, and the cache is direct mapped and 4 KiB or less.
Modern level-1 caches are much larger than 4 KiB, but virtual memory pages have stayed that size. If the cache is e.g. 16 KiB and virtually indexed, for any virtual address there are four cache locations that could hold the same physical location, but aliased to different virtual addresses. If the cache misses, all four locations must be probed to see if their corresponding physical addresses match the physical address of the access that generated the miss.
These probes are the same checks that a set associative cache uses to select a particular match. So if a 16 KiB virtually indexed cache is 4-way set associative and used with 4 KiB virtual memory pages, no special work is necessary to evict virtual aliases during cache misses because the checks have already happened while checking for a cache hit.
Using the AMD Athlon as an example again, it has a 64 KiB level-1 data cache, 4 KiB pages, and 2-way set associativity. When the level-1 data cache suffers a miss, 2 of the 16 (==64 KiB/4 KiB) possible virtual aliases have already been checked, and seven more cycles through the tag check hardware are necessary to complete the check for virtual aliases.
[edit] Virtual tags and vhints
Virtual tagging is possible too. The great advantage of virtual tags is that, for associative caches, they allow the tag match to proceed before the virtual to physical translation is done. However,
* Coherence probes and evictions present a physical address for action. The hardware must have some means of converting the physical addresses into a cache index, generally by storing physical tags as well as virtual tags. For comparison, a physically tagged cache does not need to keep virtual tags, which is simpler.
* When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on pages not mapped by the TLB, then those entries will have to be flushed when the access rights on those pages are changed in the page table.
It is also possible for the operating system to ensure that no virtual aliases are simultaneously resident in the cache. The operating system makes this guarantee by enforcing page coloring, which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen.
It can be useful to distinguish the two functions of tags in an associative cache: they are used to determine which way of the entry set to select, and they are used to determine if the cache hit or missed. The second function must always be correct, but it is permissible for the first function to guess, and get the wrong answer occasionally.
Some processors (e.g. early SPARCs) have caches with both virtual and physical tags. The virtual tags are used for way selection, and the physical tags are used for determining hit or miss. This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple software interface of a physically tagged cache. It bears the added cost of duplicated tags, however. Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted.
The extra area (and some latency) can be mitigated by keeping virtual hints with each cache entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for selecting the way of the cache from which to get data and a physical tag. Like a virtually tagged cache, there may be a virtual hint match but physical tag mismatch, in which case the cache entry with the matching hint must be evicted so that cache accesses after the cache fill at this address will have just one hint match. Since virtual hints have fewer bits than virtual tags distinguishing them from one another, a virtually hinted cache suffers more conflict misses than a virtually tagged cache.
Perhaps the ultimate reduction of virtual hints can be found in the Pentium 4 (Willamette and Northwood cores). In these processors the virtual hint is effectively 2 bits, and the cache is 4-way set associative. Effectively, the hardware maintains a simple permutation from virtual address to cache index, so that no content-addressable memory (CAM) is necessary to select the right one of the four ways fetched.
[edit] Page coloring
Main article: Cache coloring
Large physically indexed caches (usually secondary caches) run into a problem: the operating system rather than the application controls which pages collide with one another in the cache. Differences in page allocation from one program run to the next lead to differences in the cache collision patterns, which can lead to very large differences in program performance. These differences can make it very difficult to get a consistent and repeatable timing for a benchmark run, which then leads to frustrated sales engineers demanding that the operating system authors fix the problem.
To understand the problem, consider a CPU with a 1 MiB physically indexed direct-mapped level-2 cache and 4 KiB virtual memory pages. Sequential physical pages map to sequential locations in the cache until after 256 pages the pattern wraps around. We can label each physical page with a color of 0–255 to denote where in the cache it can go. Locations within physical pages with different colors cannot conflict in the cache.
A programmer attempting to make maximum use of the cache may arrange his program's access patterns so that only 1 MiB of data need be cached at any given time, thus avoiding capacity misses. But he should also ensure that the access patterns do not have conflict misses. One way to think about this problem is to divide up the virtual pages the program uses and assign them virtual colors in the same way as physical colors were assigned to physical pages before. The programmer can then arrange the access patterns of his code so that no two pages with the same virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. loop nest optimization), largely coming from the High Performance Computing (HPC) community.
The snag is that while all the pages in use at any given moment may have different virtual colors, some may have the same physical colors. In fact, if the operating system assigns physical pages to virtual pages randomly and uniformly, it is extremely likely that some pages will have the same physical color, and then locations from those pages will collide in the cache (this is the birthday paradox).
The solution is to have the operating system attempt to assign different physical color pages to different virtual colors, a technique called page coloring. Although the actual mapping from virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep track of and have little benefit, so most approaches to page coloring simply try to keep physical and virtual page colors the same.
If the operating system can guarantee that each physical page maps to only one virtual color, then there are no virtual aliases, and the processor can use virtually indexed caches with no need for extra virtual alias probes during miss handling. Alternatively, the O/S can flush a page from the cache whenever it changes from one virtual color to another. As mentioned above, this approach was used for some early SPARC and RS/6000 designs.
[edit] Cache hierarchy in a modern processor
Modern processors have multiple interacting caches on chip.
[edit] Specialized caches
Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-physical address translation, and data fetch (see classic RISC pipeline). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, TLB, and data), each specialized to its particular role.
Pipelines with separate instruction and data caches, now predominant, are said to have a Harvard architecture. Originally, this phrase referred to machines with separate instruction and data memories, which proved not at all popular. Most modern CPUs have a single-memory von Neumann architecture.
[edit] Victim cache
A victim cache is a cache used to hold blocks evicted from a CPU cache due to a conflict or capacity miss. The victim cache lies between the main cache and its refill path, and only holds blocks that were evicted from that cache on a miss. This technique is used to reduce the penalty incurred by a cache on a miss.
The original victim cache on the HP PA7200 was a small, fully-associative cache. Later processors, such as the AMD K7 and K8, used the very large secondary cache as a victim cache, to avoid duplicate storage of the contents of the large primary cache.
[edit] Trace cache
One of the more extreme examples of cache specialization is the trace cache found in the Intel Pentium 4 microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of instructions that have already been fetched and decoded.
The earliest widely acknowledged academic publication of trace cache was by Eric Rotenberg, Steve Bennett, and James E. Smith in their 1996 paper "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching."
An earlier publication is US Patent 5,381,533, "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line", by Alex Peleg and Uri Weiser of Intel Corp., patent filed March 30, 1994, a continuation of an application filed in 1992, later abandoned.
A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. A basic block consists of a group of non-branch instructions ending with a branch. A dynamic trace ("trace path") contains only instructions whose results are actually used, and eliminates instructions following taken branches (since they are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows the instruction fetch unit of a processor to fetch several basic blocks, without having to worry about branches in the execution flow.
Trace lines are stored in the trace cache based on the program counter of the first instruction in the trace and a set of branch predictions. This allows for storing different trace paths that start on the same address, each representing different branch outcomes. In the instruction fetch stage of a pipeline, the current program counter along with a set of branch predictions is checked in the trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to a regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit until the trace line ends or until there is a misprediction in the pipeline. If there is a miss, a new trace starts to be built.
Trace caches are also used in processors like the Intel Pentium 4 to store already decoded micro-operations, or translations of complex x86 instructions, so that the next time an instruction is needed, it does not have to be decoded again.
See the full text of Smith, Rotenberg and Bennett's paper at Citeseer.
[edit] Multi-level caches
Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger slower caches.
Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is checked, and so on, before external memory is checked.
As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. For example, in 2003, Itanium 2 began shipping with a 6 MiB unified level 3 (L3) cache on-chip. The IBM Power 4 series has a 256 MiB L3 cache off chip, shared among several processors. The new AMD Phenom series of chips carries a 2MB on die L3 cache.
[edit] Exclusive versus inclusive
Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive. Other processors (like the AMD Athlon) have exclusive caches — data is guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel Pentium II, III, and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy, although the term mainly inclusive has been used.
The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.
One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted.
Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary cache is an order of magnitude larger than the primary, and the cache data is an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.
As mentioned above, larger computers sometimes have another cache between the L2 cache and main memory called an L3 cache. This cache can be implemented on a separate chip from the CPU, and, as of 2004, may range in size from 2 to 256 megabytes. The benefits of an off chip L3 cache depend on the application's access patterns. High-end x86 workstations and servers are now available with an L3 cache option implemented on the microprocessor die, increasing the speed and reducing the cost substantially. For example, Intel's Xeon MP product code-named "Tulsa" features 16 MiB of on-die L3 cache, shared between two processor cores.
Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1 (circa 1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)
[edit] Example: the K8
To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD Athlon 64 CPU.[5]
Example of hierarchy, the K8
Example of hierarchy, the K8
The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. Each of these caches is specialized:
* The instruction cache keeps copies of 64 byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than 8, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only parity protection rather than ECC, because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions).
* The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either 4 or 8 bytes in memory. Each of the TLBs is split into two sections, one to keep PTEs that map 4 KiB, and one to keep PTEs that map 4 MiB or 2 MiB. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs.
* The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of entries.
* The data cache keeps copies of 64 byte lines of memory. It is split into 8 banks (each storing 8 KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64 byte line is spread among all 8 banks. Each tag copy handles one of the two accesses per cycle.
The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4 KiB. Both instruction and data caches, and the various TLBs, can fill from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.
The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex branch prediction, with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache.
The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an alpha particle strike) by either ECC or parity, depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy.
[edit] More hierarchies
Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in the DEC Alpha 21264), and various specialized predictors are likely to flourish in future processors.
These predictors are caches in the sense that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a hit in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
The K8 keeps the instruction and data caches coherent in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency. The idea is to save hardware complexity on the assumption that self-modifying code is rare.
[edit] Implementation
Cache reads are the most common CPU operation that takes more than a single cycle. Program execution time tends to be very sensitive to the latency of a level-1 data cache hit. A great deal of design effort, and often power and silicon area are expended making the caches as fast as possible.
The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated with an adder, the relevant portion of the address extracted and used to index an SRAM, which returns the loaded data. The data is byte aligned in a byte shifter, and from there is bypassed to the next operation. There is no need for any tag checking in the inner loop — in fact, the tags need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for the loaded data must be read, and checked against the virtual address to make sure there was a cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is restarted.
An associative cache is more complicated, because some form of tag must be read to determine which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N possible tags and N data in parallel, and then chooses the data associated with the matching tag. Level-2 caches sometimes save power by reading the tags first, so that only one data element is read from the data SRAM.
Read path for a 2-way associative cache
Read path for a 2-way associative cache
The diagram to the right is intended to clarify the manner in which the various fields of the address are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the SRAMs, indexing, and multiplexing for a 4 KiB, 2-way set-associative, virtually indexed and virtually tagged cache with 64 B lines, a 32b read width and 32b virtual address.
Because the cache is 4 KiB and has 64 B lines, there are just 64 lines in the cache, and we read two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs, it is simplest to use the least significant bits.
Similarly, because the cache is 4 KiB and has a 4 B read path, and reads two ways for each access, the Data SRAM is 512 rows by 8 bytes wide.
A more modern cache might be 16 KiB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32 B lines, 32b read width and 36b physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, vhints are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the vhint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.
Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. See Sum addressed decoder.
[edit] History
The early history of cache technology is closely tied to the invention and use of virtual memory.[citation needed] Because of scarcity and cost of semi-conductors memories, early mainframe computers in 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual memory used by programs. The memory technologies would span semi-conductor, magnetic core, drum and disc. Virtual memory seen and used by programs would be flat and caching would be used to fetch data and instructions into the fastest memory ahead of processor access. Extensive studies were done to optimise the cache sizes. Optimal values were found to depend greatly on the programming language used with Algol needing the smallest and Fortran and Cobol needing the largest cache sizes.
The arrival of PCs coincided with a temporary decline in interest in caching. In the early days of PC technology, memory access was only slightly slower than register access. But since the 1980s [6] the performance gap between processor and memory has been growing. Processors have advanced much faster than memory, especially in terms of their operating frequency, so memory became a performance bottleneck. While it was technically possible to have all the main memory as fast as the processor, a more economically viable path has been taken: use plenty of low speed memory, but also introduce a small high speed cache memory to alleviate the performance gap. This provided an order of magnitude more capacity—for the same price—with only a slightly reduced combined performance.
[edit] History of cache in x86 architecture
As the x86 CPU architecture reached clock speeds of 20 MHz and above in the 386, small amounts of fast cache memory began to be included in the architecture to boost performance. This was because the DRAM used for main memory had significant latency, up to 120ns, as well as refresh cycles. The cache was constructed from more expensive, yet significantly faster, SRAM, which at the time had latencies around 10ns. The early caches were external to the processor and typically located on the motherboard in the form of 8 or 9 DIP memory chips placed in sockets to enable the cache as an optional extra or upgrade feature.
Some versions of the Intel 386 type processor could support 16 to 64 KiB of external cache.
With the 486 processor an 8 KiB cache was integrated directly into the CPU die. This cache was termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2) cache. These on-motherboard caches were much larger, with the most common size being 256 KiB and frequently utilizing a SIMM form factor. The popularity of on-motherboard cache continued on through the Pentium MMX era but was made obsolete by the introduction of SDRAM and the growing disparity between bus speed and CPU clock speed, which caused on-motherboard cache to be only slightly faster than main memory.
The next evolution of the x86 architecture starting with the Pentium Pro brought the secondary cache into the CPU, clocked at or slightly slower than the CPU frequency and level 1 cache.
When the processor wishes to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.
The diagram on the right shows two memories. Each location in each memory has a datum (a cache line), which in different designs ranges in size from 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested by a CPU instruction, which ranges from 1 to 16 bytes. Each location in each memory also has an index, which is a unique number used to refer to that location. The index for a location in main memory is called an address. Each location in the cache has a tag which contains the index of the datum in main memory which has been cached. In a CPU's data cache these entries are called cache lines or cache blocks.
Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer used to speed up virtual-to-physical address translation for both executable instructions and data.
However, of all microprocessor units sold (including embedded processors), most of them do not have any cache[1] -- mostly to reduce cost, but sometimes to improve the determinism of a real-time computing system.
Contents
[hide]
* 1 Details of operation
* 2 Associativity
o 2.1 Pseudo-associative cache
* 3 Cache misses
* 4 Address translation
o 4.1 Virtual indexing and virtual aliases
o 4.2 Virtual tags and vhints
o 4.3 Page coloring
* 5 Cache hierarchy in a modern processor
o 5.1 Specialized caches
+ 5.1.1 Victim cache
+ 5.1.2 Trace cache
o 5.2 Multi-level caches
+ 5.2.1 Exclusive versus inclusive
o 5.3 Example: the K8
o 5.4 More hierarchies
* 6 Implementation
* 7 History
o 7.1 History of cache in x86 architecture
* 8 See also
* 9 Notes and references
* 10 External links
[edit] Details of operation
When the processor wishes to read or write a location in main memory, it first checks whether that memory location is in the cache. This is accomplished by comparing the address of the memory location to all tags in the cache that might contain that address. If the processor finds that the memory location is in the cache, we say that a cache hit has occurred, otherwise we speak of a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. The proportion of accesses that result in a cache hit is known as the hit rate, and is a measure of the effectiveness of the cache.
In the case of a cache miss, most caches allocate a new entry, which comprises the tag just missed and a copy of the data from memory. The reference can then be applied to the new entry just as in the case of a hit. Misses are comparatively slow because they require the data to be transferred from main memory. This transfer incurs a delay since main memory is much slower than cache memory, and also incurs the overhead for recording the new data in the cache before it is delivered to the processor.
In order to make room for the new entry on a cache miss, the cache generally has to evict one of the existing entries. The heuristic that it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, especially for hardware caches which use simple rules amenable to implementation in circuitry, so there are a variety of replacement policies to choose from and no perfect way to decide among them. One popular replacement policy, LRU, replaces the least recently used entry.
When data is written to the cache, it must at some point be written to main memory as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to memory. Instead, the cache tracks which locations have been written over (these locations are marked dirty). The data in these locations is written back to main memory when that data is evicted from the cache. For this reason, a miss in a write-back cache will often require two memory accesses to service: one to first write the dirty location to memory and then another to read the new location from memory.
There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so that multiple stores can be processed together (which can reduce bus turnarounds and so improve bus utilization).
The data in main memory being cached may be changed by other entities, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the CPU updates the data in the cache, copies of data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as cache coherence protocols.
The time taken to fetch a datum from memory (read latency) matters because a CPU will often run out of things to do while waiting for the datum. When a CPU reaches this state, it is called a stall. As CPUs become faster, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single datum from memory. Various techniques have been employed to keep the CPU busy during this time. Out-of-order CPUs (Pentium Pro and later Intel's designs, for example) attempt to execute independent instructions after the instruction which is waiting for the cache miss data. Another technology, used by many processors, is simultaneous multithreading (SMT), or in Intel's terminology hyper-threading (HT), which allows an alternate thread to use the CPU core while a first thread waits for data to come from main memory.
Associativity
Which memory locations can be cached by which cache locations
Which memory locations can be cached by which cache locations
The replacement policy decides where in the cache a copy of a particular entry of main memory will go. If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Many caches implement a compromise, and are described as set associative. For example, the level-1 data cache in an AMD Athlon is 2-way set associative, which means that any particular location in main memory can be cached in either of 2 locations in the level-1 data cache.
Associativity is a trade-off. If there are ten places the replacement policy can put a new cache entry, then when the cache is checked for a hit, all ten places must be searched. Checking more places takes more power, area, and potentially time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the CPU spends less time servicing those misses. The rule of thumb is that doubling the associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on the hit rate, and are generally done for other reasons (see virtual aliasing, below).
In order of increasing (worse) hit times and decreasing (better) miss rates,
* direct mapped cache -- the best (fastest) hit times, and so the best tradeoff for "large" caches
* 2-way set associative cache
* 2-way skewed associative cache -- "the best tradeoff for .... caches whose sizes are in the range 4K-8K bytes" -- André Seznec[2]
* 4-way set associative cache
* fully associative cache -- the best (lowest) miss rates, and so the best tradeoff when the miss penalty is very high
If each location in main memory can be cached in either of two locations in the cache, one logical question is: which two? The simplest and most commonly used scheme, shown in the right-hand diagram above, is to use the least significant bits of the memory location's index as the index for the cache memory, and to have two entries for each index. One good property of this scheme is that the tags stored in the cache do not have to include that part of the main memory address which is implied by the cache memory's index. Since the cache tags are fewer bits, they take less area [on the microprocessor chip] and can be read and compared faster.
One of the advantages of a direct mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that datum is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.
The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. This datum can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below.
Other schemes have been suggested, such as the skewed cache[2], where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function[3]. Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones.[4]
[edit] Pseudo-associative cache
A true set-associative cache tests all the possible ways simultaneously, using something like a content addressable memory. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache is one kind of pseudo-associative cache.
In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache. [3]
[edit] Cache misses
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.
A cache read miss from an instruction cache generally causes the most delay, because the processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from main memory.
A cache read miss from a data cache usually causes less delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution.
A cache write miss to a data cache generally causes the least delay, because the write can be queued and there are few limitations on the execution of subsequent instructions. The processor can continue until the queue is full.
In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an attempt to find the best combination of size, associativity, block size, and so on. Sequences of memory references performed by benchmark programs are saved as address traces. Subsequent analyses simulate many different possible cache designs on these long address traces. Making sense of how the many variables affect the cache hit rate can be quite confusing. One significant contribution to this analysis was made by Mark Hill, who separated misses into three categories (known as the Three Cs):
* Compulsory misses are those misses caused by the first reference to a datum. Cache size and associativity make no difference to the number of compulsory misses. Prefetching can help here, as can larger cache block sizes (which are a form of prefetching).
* Capacity misses are those misses that occur regardless of associativity or block size, solely due to the finite size of the cache. The curve of capacity miss rate versus cache size gives some measure of the temporal locality of a particular reference stream. Note that there is no useful notion of a cache being "full" or "empty" or "near capacity": CPU caches almost always have nearly every line filled with a copy of some line in main memory, and nearly every allocation of a new line requires the eviction of an old line.
* Conflict misses are those misses that could have been avoided, had the cache not evicted an entry earlier. Conflict misses can be further broken down into mapping misses, that are unavoidable given a particular amount of associativity, and replacement misses, which are due to the particular victim choice of the replacement policy.
Miss rate versus cache size on the Integer portion of SPEC CPU2000
Miss rate versus cache size on the Integer portion of SPEC CPU2000
The graph to the right summarizes the cache performance seen on the Integer portion of the SPEC CPU2000 benchmarks, as collected by Hill and Cantin [1]. These benchmarks are intended to represent the kind of workload that an engineering workstation computer might see on any given day. The reader should keep in mind that finding benchmarks which are even usefully representative of many programs has been very difficult, and there will always be important programs with very different behavior than what is shown here.
We can see the different effects of the three Cs in this graph.
At the far right, with cache size labelled "Inf", we have the compulsory misses. If we wish to improve a machine's performance on SpecInt2000, increasing the cache size beyond 1 MiB is essentially futile. That's the insight given by the compulsory misses.
The fully-associative cache miss rate here is almost representative of the capacity miss rate. The difference is that the data presented is from simulations assuming an LRU replacement policy. Showing the capacity miss rate would require a perfect replacement policy, i.e. an oracle that looks into the future to find a cache entry which is actually not going to be hit.
Note that our approximation of the capacity miss rate falls steeply between 32 KiB and 64 KiB. This indicates that the benchmark has a working set of roughly 64 KiB. A CPU cache designer examining this benchmark will have a strong incentive to set the cache size to 64 KiB rather than 32 KiB. Note that, on this benchmark, no amount of associativity can make a 32 KiB cache perform as well as a 64 KiB 4-way, or even a direct-mapped 128 KiB cache.
Finally, note that between 64 KiB and 1 MiB there is a large difference between direct-mapped and fully-associative caches. This difference is the conflict miss rate. The insight from looking at conflict miss rates is that secondary caches benefit a great deal from high associativity.
This benefit was well known in the late 80s and early 90s, when CPU designers could not fit large caches on-chip, and could not get sufficient bandwidth to either the cache data memory or cache tag memory to implement high associativity in off-chip caches. Desperate hacks were attempted: the MIPS R8000 used expensive off-chip dedicated tag SRAMs, which had embedded tag comparators and large drivers on the match lines, in order to implement a 4 MiB 4-way associative cache. The MIPS R10000 used ordinary SRAM chips for the tags. Tag access for both ways took two cycles. To reduce latency, the R10000 would guess which way of the cache would hit on each access.
[edit] Address translation
Main article: translation lookaside buffer
Most general purpose CPUs implement some form of virtual memory. To summarize, each program running on the machine sees its own simplified address space, which contains code and data for that program only. Each program places things in its address space without regard for what other programs are doing in their address spaces.
Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation is known as the memory management unit (MMU). The fast path through the MMU can perform those translations stored in the Translation Lookaside Buffer (TLB), which is a cache of mappings from the operating system's page table. ' For the purposes of the present discussion, there are three important features of address translation:
* Latency: The physical address is available from the MMU some time, perhaps a few cycles, after the virtual address is available from the address generator.
* Aliasing: Multiple virtual addresses can map to a single physical address. Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given time.
* Granularity: The virtual address space is broken up into pages. For instance, a 4 GiB virtual address space might be cut up into 1048576 4 KiB pages, each of which can be independently mapped. There may be multiple page sizes supported, see virtual memory for elaboration.
A historical note: the first virtual memory systems were very slow, because they required an access to the page table (held in main memory) before every programmed access to main memory. With no caches, this effectively cut the speed of the machine in half. The first hardware cache used in a computer system was not actually a data or instruction cache, but rather a TLB.
Virtually indexed caches use a portion of the virtual address for their index, which is available earlier than the physical address. If the cache is either direct mapped or virtually tagged, there is no need to consult the MMU to determine which data to feed back to the execution datapath, and so the cache can be very fast (especially if small, as were the primary data caches in the first two generations of the Intel Pentium 4). The speed of this recurrence (the load latency) is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM.
But virtual indexing is not the best choice for all cache levels. It introduces the problem of virtual aliases — the cache may have multiple locations which can store the value of a single physical address. The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed.
Caches have historically used both virtual and physical addresses for the cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging. Large caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints, as described below.
[edit] Virtual indexing and virtual aliases
The usual way the processor guarantees that virtually aliased addresses act as a single storage location is to arrange that only one virtual alias can be in the cache at any given time.
Whenever a new entry is added to a virtually-indexed cache, the processor searches for any virtual aliases already resident and evicts them first. This special handling happens only during a cache miss. No special work is necessary during a cache hit, which helps keep the fast path fast.
The most straightforward way to find aliases is to arrange for them all to map to the same location in the cache. This happens, for instance, if the TLB has e.g. 4 KiB pages, and the cache is direct mapped and 4 KiB or less.
Modern level-1 caches are much larger than 4 KiB, but virtual memory pages have stayed that size. If the cache is e.g. 16 KiB and virtually indexed, for any virtual address there are four cache locations that could hold the same physical location, but aliased to different virtual addresses. If the cache misses, all four locations must be probed to see if their corresponding physical addresses match the physical address of the access that generated the miss.
These probes are the same checks that a set associative cache uses to select a particular match. So if a 16 KiB virtually indexed cache is 4-way set associative and used with 4 KiB virtual memory pages, no special work is necessary to evict virtual aliases during cache misses because the checks have already happened while checking for a cache hit.
Using the AMD Athlon as an example again, it has a 64 KiB level-1 data cache, 4 KiB pages, and 2-way set associativity. When the level-1 data cache suffers a miss, 2 of the 16 (==64 KiB/4 KiB) possible virtual aliases have already been checked, and seven more cycles through the tag check hardware are necessary to complete the check for virtual aliases.
[edit] Virtual tags and vhints
Virtual tagging is possible too. The great advantage of virtual tags is that, for associative caches, they allow the tag match to proceed before the virtual to physical translation is done. However,
* Coherence probes and evictions present a physical address for action. The hardware must have some means of converting the physical addresses into a cache index, generally by storing physical tags as well as virtual tags. For comparison, a physically tagged cache does not need to keep virtual tags, which is simpler.
* When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on pages not mapped by the TLB, then those entries will have to be flushed when the access rights on those pages are changed in the page table.
It is also possible for the operating system to ensure that no virtual aliases are simultaneously resident in the cache. The operating system makes this guarantee by enforcing page coloring, which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen.
It can be useful to distinguish the two functions of tags in an associative cache: they are used to determine which way of the entry set to select, and they are used to determine if the cache hit or missed. The second function must always be correct, but it is permissible for the first function to guess, and get the wrong answer occasionally.
Some processors (e.g. early SPARCs) have caches with both virtual and physical tags. The virtual tags are used for way selection, and the physical tags are used for determining hit or miss. This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple software interface of a physically tagged cache. It bears the added cost of duplicated tags, however. Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted.
The extra area (and some latency) can be mitigated by keeping virtual hints with each cache entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for selecting the way of the cache from which to get data and a physical tag. Like a virtually tagged cache, there may be a virtual hint match but physical tag mismatch, in which case the cache entry with the matching hint must be evicted so that cache accesses after the cache fill at this address will have just one hint match. Since virtual hints have fewer bits than virtual tags distinguishing them from one another, a virtually hinted cache suffers more conflict misses than a virtually tagged cache.
Perhaps the ultimate reduction of virtual hints can be found in the Pentium 4 (Willamette and Northwood cores). In these processors the virtual hint is effectively 2 bits, and the cache is 4-way set associative. Effectively, the hardware maintains a simple permutation from virtual address to cache index, so that no content-addressable memory (CAM) is necessary to select the right one of the four ways fetched.
[edit] Page coloring
Main article: Cache coloring
Large physically indexed caches (usually secondary caches) run into a problem: the operating system rather than the application controls which pages collide with one another in the cache. Differences in page allocation from one program run to the next lead to differences in the cache collision patterns, which can lead to very large differences in program performance. These differences can make it very difficult to get a consistent and repeatable timing for a benchmark run, which then leads to frustrated sales engineers demanding that the operating system authors fix the problem.
To understand the problem, consider a CPU with a 1 MiB physically indexed direct-mapped level-2 cache and 4 KiB virtual memory pages. Sequential physical pages map to sequential locations in the cache until after 256 pages the pattern wraps around. We can label each physical page with a color of 0–255 to denote where in the cache it can go. Locations within physical pages with different colors cannot conflict in the cache.
A programmer attempting to make maximum use of the cache may arrange his program's access patterns so that only 1 MiB of data need be cached at any given time, thus avoiding capacity misses. But he should also ensure that the access patterns do not have conflict misses. One way to think about this problem is to divide up the virtual pages the program uses and assign them virtual colors in the same way as physical colors were assigned to physical pages before. The programmer can then arrange the access patterns of his code so that no two pages with the same virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. loop nest optimization), largely coming from the High Performance Computing (HPC) community.
The snag is that while all the pages in use at any given moment may have different virtual colors, some may have the same physical colors. In fact, if the operating system assigns physical pages to virtual pages randomly and uniformly, it is extremely likely that some pages will have the same physical color, and then locations from those pages will collide in the cache (this is the birthday paradox).
The solution is to have the operating system attempt to assign different physical color pages to different virtual colors, a technique called page coloring. Although the actual mapping from virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep track of and have little benefit, so most approaches to page coloring simply try to keep physical and virtual page colors the same.
If the operating system can guarantee that each physical page maps to only one virtual color, then there are no virtual aliases, and the processor can use virtually indexed caches with no need for extra virtual alias probes during miss handling. Alternatively, the O/S can flush a page from the cache whenever it changes from one virtual color to another. As mentioned above, this approach was used for some early SPARC and RS/6000 designs.
[edit] Cache hierarchy in a modern processor
Modern processors have multiple interacting caches on chip.
[edit] Specialized caches
Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-physical address translation, and data fetch (see classic RISC pipeline). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, TLB, and data), each specialized to its particular role.
Pipelines with separate instruction and data caches, now predominant, are said to have a Harvard architecture. Originally, this phrase referred to machines with separate instruction and data memories, which proved not at all popular. Most modern CPUs have a single-memory von Neumann architecture.
[edit] Victim cache
A victim cache is a cache used to hold blocks evicted from a CPU cache due to a conflict or capacity miss. The victim cache lies between the main cache and its refill path, and only holds blocks that were evicted from that cache on a miss. This technique is used to reduce the penalty incurred by a cache on a miss.
The original victim cache on the HP PA7200 was a small, fully-associative cache. Later processors, such as the AMD K7 and K8, used the very large secondary cache as a victim cache, to avoid duplicate storage of the contents of the large primary cache.
[edit] Trace cache
One of the more extreme examples of cache specialization is the trace cache found in the Intel Pentium 4 microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of instructions that have already been fetched and decoded.
The earliest widely acknowledged academic publication of trace cache was by Eric Rotenberg, Steve Bennett, and James E. Smith in their 1996 paper "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching."
An earlier publication is US Patent 5,381,533, "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line", by Alex Peleg and Uri Weiser of Intel Corp., patent filed March 30, 1994, a continuation of an application filed in 1992, later abandoned.
A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. A basic block consists of a group of non-branch instructions ending with a branch. A dynamic trace ("trace path") contains only instructions whose results are actually used, and eliminates instructions following taken branches (since they are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows the instruction fetch unit of a processor to fetch several basic blocks, without having to worry about branches in the execution flow.
Trace lines are stored in the trace cache based on the program counter of the first instruction in the trace and a set of branch predictions. This allows for storing different trace paths that start on the same address, each representing different branch outcomes. In the instruction fetch stage of a pipeline, the current program counter along with a set of branch predictions is checked in the trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to a regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit until the trace line ends or until there is a misprediction in the pipeline. If there is a miss, a new trace starts to be built.
Trace caches are also used in processors like the Intel Pentium 4 to store already decoded micro-operations, or translations of complex x86 instructions, so that the next time an instruction is needed, it does not have to be decoded again.
See the full text of Smith, Rotenberg and Bennett's paper at Citeseer.
[edit] Multi-level caches
Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger slower caches.
Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is checked, and so on, before external memory is checked.
As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. For example, in 2003, Itanium 2 began shipping with a 6 MiB unified level 3 (L3) cache on-chip. The IBM Power 4 series has a 256 MiB L3 cache off chip, shared among several processors. The new AMD Phenom series of chips carries a 2MB on die L3 cache.
[edit] Exclusive versus inclusive
Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive. Other processors (like the AMD Athlon) have exclusive caches — data is guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel Pentium II, III, and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy, although the term mainly inclusive has been used.
The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.
One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted.
Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary cache is an order of magnitude larger than the primary, and the cache data is an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.
As mentioned above, larger computers sometimes have another cache between the L2 cache and main memory called an L3 cache. This cache can be implemented on a separate chip from the CPU, and, as of 2004, may range in size from 2 to 256 megabytes. The benefits of an off chip L3 cache depend on the application's access patterns. High-end x86 workstations and servers are now available with an L3 cache option implemented on the microprocessor die, increasing the speed and reducing the cost substantially. For example, Intel's Xeon MP product code-named "Tulsa" features 16 MiB of on-die L3 cache, shared between two processor cores.
Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1 (circa 1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)
[edit] Example: the K8
To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD Athlon 64 CPU.[5]
Example of hierarchy, the K8
Example of hierarchy, the K8
The K8 has 4 specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. Each of these caches is specialized:
* The instruction cache keeps copies of 64 byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than 8, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only parity protection rather than ECC, because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions).
* The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either 4 or 8 bytes in memory. Each of the TLBs is split into two sections, one to keep PTEs that map 4 KiB, and one to keep PTEs that map 4 MiB or 2 MiB. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs.
* The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of entries.
* The data cache keeps copies of 64 byte lines of memory. It is split into 8 banks (each storing 8 KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64 byte line is spread among all 8 banks. Each tag copy handles one of the two accesses per cycle.
The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4 KiB. Both instruction and data caches, and the various TLBs, can fill from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.
The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex branch prediction, with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache.
The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an alpha particle strike) by either ECC or parity, depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy.
[edit] More hierarchies
Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in the DEC Alpha 21264), and various specialized predictors are likely to flourish in future processors.
These predictors are caches in the sense that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a hit in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
The K8 keeps the instruction and data caches coherent in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency. The idea is to save hardware complexity on the assumption that self-modifying code is rare.
[edit] Implementation
Cache reads are the most common CPU operation that takes more than a single cycle. Program execution time tends to be very sensitive to the latency of a level-1 data cache hit. A great deal of design effort, and often power and silicon area are expended making the caches as fast as possible.
The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated with an adder, the relevant portion of the address extracted and used to index an SRAM, which returns the loaded data. The data is byte aligned in a byte shifter, and from there is bypassed to the next operation. There is no need for any tag checking in the inner loop — in fact, the tags need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for the loaded data must be read, and checked against the virtual address to make sure there was a cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is restarted.
An associative cache is more complicated, because some form of tag must be read to determine which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N possible tags and N data in parallel, and then chooses the data associated with the matching tag. Level-2 caches sometimes save power by reading the tags first, so that only one data element is read from the data SRAM.
Read path for a 2-way associative cache
Read path for a 2-way associative cache
The diagram to the right is intended to clarify the manner in which the various fields of the address are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the SRAMs, indexing, and multiplexing for a 4 KiB, 2-way set-associative, virtually indexed and virtually tagged cache with 64 B lines, a 32b read width and 32b virtual address.
Because the cache is 4 KiB and has 64 B lines, there are just 64 lines in the cache, and we read two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs, it is simplest to use the least significant bits.
Similarly, because the cache is 4 KiB and has a 4 B read path, and reads two ways for each access, the Data SRAM is 512 rows by 8 bytes wide.
A more modern cache might be 16 KiB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32 B lines, 32b read width and 36b physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, vhints are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the vhint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.
Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. See Sum addressed decoder.
[edit] History
The early history of cache technology is closely tied to the invention and use of virtual memory.[citation needed] Because of scarcity and cost of semi-conductors memories, early mainframe computers in 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual memory used by programs. The memory technologies would span semi-conductor, magnetic core, drum and disc. Virtual memory seen and used by programs would be flat and caching would be used to fetch data and instructions into the fastest memory ahead of processor access. Extensive studies were done to optimise the cache sizes. Optimal values were found to depend greatly on the programming language used with Algol needing the smallest and Fortran and Cobol needing the largest cache sizes.
The arrival of PCs coincided with a temporary decline in interest in caching. In the early days of PC technology, memory access was only slightly slower than register access. But since the 1980s [6] the performance gap between processor and memory has been growing. Processors have advanced much faster than memory, especially in terms of their operating frequency, so memory became a performance bottleneck. While it was technically possible to have all the main memory as fast as the processor, a more economically viable path has been taken: use plenty of low speed memory, but also introduce a small high speed cache memory to alleviate the performance gap. This provided an order of magnitude more capacity—for the same price—with only a slightly reduced combined performance.
[edit] History of cache in x86 architecture
As the x86 CPU architecture reached clock speeds of 20 MHz and above in the 386, small amounts of fast cache memory began to be included in the architecture to boost performance. This was because the DRAM used for main memory had significant latency, up to 120ns, as well as refresh cycles. The cache was constructed from more expensive, yet significantly faster, SRAM, which at the time had latencies around 10ns. The early caches were external to the processor and typically located on the motherboard in the form of 8 or 9 DIP memory chips placed in sockets to enable the cache as an optional extra or upgrade feature.
Some versions of the Intel 386 type processor could support 16 to 64 KiB of external cache.
With the 486 processor an 8 KiB cache was integrated directly into the CPU die. This cache was termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2) cache. These on-motherboard caches were much larger, with the most common size being 256 KiB and frequently utilizing a SIMM form factor. The popularity of on-motherboard cache continued on through the Pentium MMX era but was made obsolete by the introduction of SDRAM and the growing disparity between bus speed and CPU clock speed, which caused on-motherboard cache to be only slightly faster than main memory.
The next evolution of the x86 architecture starting with the Pentium Pro brought the secondary cache into the CPU, clocked at or slightly slower than the CPU frequency and level 1 cache.
Linear system equations by method Gauss via MPI
Problem: Solve linear system equations by method Gauss via MPI.
[Code:]
#include "iostream"
#include "mpi.h"
#include "windows.h"
#include "time.h"
#include "cassert"
using namespace std;
int val()
{
return 9*(int)rand()/RAND_MAX + 1;
}
int main(int argc, char ** argv)
{
double elapsed_time;
int i,j,k,index;
int mynode, totalnodes;
double scale;
int size;
srand((unsigned int) time(0));
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
if (mynode == 0)
{
cout <<"Number of processors: " << totalnodes << endl;
cout <<"Input size: ";
cin >> size;
assert(size % totalnodes == 0);
for(int i = 1; i <= totalnodes; i++)
MPI_Send(&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
}
if (mynode != 0)
MPI_Recv(&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
elapsed_time = MPI_Wtime();
double **ttA = new double*[size];
for(int l = 0; l < size; l++)
ttA[l] = new double[size+1];
for(l = 0; l < size; l++)
{
for(int m = 0; m < size+1; m++)
ttA[l][m] = val();
}
if (!mynode)
for(l = 0; l < size; l++)
{
for(int m = 0; m < size+1; m++)
{
cout << ttA[l][m] <<" " ;
}
cout << endl;
}
int numrows = size / totalnodes;
double **A_Local = new double*[numrows];
for(i = 0; i < numrows; i++)
A_Local[i] = new double[size+1];
int *myrows = new int[numrows];
for(i = 0; i < numrows; i++)
{
index = mynode + i * totalnodes;
myrows[i] = index;
for(int j = 0; j < size+1; j++)
A_Local[i][j] = ttA[index][j];
}
double *tmp = new double[size+1];
double *x = new double[size];
int cnt = 0;
for(i = 0; i < size-1; i++)
{
if (i == myrows[cnt])
{
//broadcast A_Local[i] & update tmp
MPI_Bcast(A_Local[cnt], size+1, MPI_DOUBLE, mynode, MPI_COMM_WORLD);
for(j = 0; j < size+1; j++)
tmp[j] = A_Local[cnt][j];
cnt++;
}
else
MPI_Bcast(tmp, size+1, MPI_DOUBLE, i%totalnodes, MPI_COMM_WORLD);
// update A_Local from tmp
for(j = cnt; j < numrows; j++)
{
scale = A_Local[j][i]/tmp[i];
for(k = i; k < size+1; k++)
A_Local[j][k] = A_Local[j][k] - scale * tmp[k];
}
}
// preparation for substitution
/* init root x equal to right-hand side of system equations ( vector b)otherwise x[i] = 0 */
cnt = 0;
for(i = 0; i < size; i++)
{
if (i == myrows[cnt])
{
x[i] = A_Local[cnt][size];
cnt++;
}
else
x[i] = 0;
}
// substitution to find root
cnt = numrows-1;
for(i = size-1; i>0; i--)
{
if(cnt >= 0)
{
if (i == myrows[cnt])
{
x[i] = x[i]/A_Local[cnt][i];
MPI_Bcast(x+i, 1, MPI_DOUBLE, mynode, MPI_COMM_WORLD);
cnt--;
}
else
MPI_Bcast(x+i, 1, MPI_DOUBLE, i%totalnodes, MPI_COMM_WORLD);
}
else
MPI_Bcast(x+i, 1, MPI_DOUBLE, i%totalnodes, MPI_COMM_WORLD);
//update x[i]
for(j = 0; j <= cnt; j++)
x[myrows[j]] = x[myrows[j]] - A_Local[j][i]*x[i];
}
if(mynode==0)
{
x[0] = x[0]/A_Local[cnt][0];
MPI_Bcast(x,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
}
else
MPI_Bcast(x,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
elapsed_time = MPI_Wtime() - elapsed_time;
if(mynode==0)
{
for(i=0; i < size;i++)
cout <<"x["< cout <<"Elapsed time: " << elapsed_time <<" sec" << endl;
}
delete [] tmp;
delete [] myrows;
for(i = 0; i < numrows; i++)
delete [] A_Local[i];
delete [] A_Local;
for( i = 0; i < size; i++)
delete [] ttA[i];
delete [] ttA;
MPI_Finalize();
return 0;
}
[Code:]
#include "iostream"
#include "mpi.h"
#include "windows.h"
#include "time.h"
#include "cassert"
using namespace std;
int val()
{
return 9*(int)rand()/RAND_MAX + 1;
}
int main(int argc, char ** argv)
{
double elapsed_time;
int i,j,k,index;
int mynode, totalnodes;
double scale;
int size;
srand((unsigned int) time(0));
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
if (mynode == 0)
{
cout <<"Number of processors: " << totalnodes << endl;
cout <<"Input size: ";
cin >> size;
assert(size % totalnodes == 0);
for(int i = 1; i <= totalnodes; i++)
MPI_Send(&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
}
if (mynode != 0)
MPI_Recv(&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
elapsed_time = MPI_Wtime();
double **ttA = new double*[size];
for(int l = 0; l < size; l++)
ttA[l] = new double[size+1];
for(l = 0; l < size; l++)
{
for(int m = 0; m < size+1; m++)
ttA[l][m] = val();
}
if (!mynode)
for(l = 0; l < size; l++)
{
for(int m = 0; m < size+1; m++)
{
cout << ttA[l][m] <<" " ;
}
cout << endl;
}
int numrows = size / totalnodes;
double **A_Local = new double*[numrows];
for(i = 0; i < numrows; i++)
A_Local[i] = new double[size+1];
int *myrows = new int[numrows];
for(i = 0; i < numrows; i++)
{
index = mynode + i * totalnodes;
myrows[i] = index;
for(int j = 0; j < size+1; j++)
A_Local[i][j] = ttA[index][j];
}
double *tmp = new double[size+1];
double *x = new double[size];
int cnt = 0;
for(i = 0; i < size-1; i++)
{
if (i == myrows[cnt])
{
//broadcast A_Local[i] & update tmp
MPI_Bcast(A_Local[cnt], size+1, MPI_DOUBLE, mynode, MPI_COMM_WORLD);
for(j = 0; j < size+1; j++)
tmp[j] = A_Local[cnt][j];
cnt++;
}
else
MPI_Bcast(tmp, size+1, MPI_DOUBLE, i%totalnodes, MPI_COMM_WORLD);
// update A_Local from tmp
for(j = cnt; j < numrows; j++)
{
scale = A_Local[j][i]/tmp[i];
for(k = i; k < size+1; k++)
A_Local[j][k] = A_Local[j][k] - scale * tmp[k];
}
}
// preparation for substitution
/* init root x equal to right-hand side of system equations ( vector b)otherwise x[i] = 0 */
cnt = 0;
for(i = 0; i < size; i++)
{
if (i == myrows[cnt])
{
x[i] = A_Local[cnt][size];
cnt++;
}
else
x[i] = 0;
}
// substitution to find root
cnt = numrows-1;
for(i = size-1; i>0; i--)
{
if(cnt >= 0)
{
if (i == myrows[cnt])
{
x[i] = x[i]/A_Local[cnt][i];
MPI_Bcast(x+i, 1, MPI_DOUBLE, mynode, MPI_COMM_WORLD);
cnt--;
}
else
MPI_Bcast(x+i, 1, MPI_DOUBLE, i%totalnodes, MPI_COMM_WORLD);
}
else
MPI_Bcast(x+i, 1, MPI_DOUBLE, i%totalnodes, MPI_COMM_WORLD);
//update x[i]
for(j = 0; j <= cnt; j++)
x[myrows[j]] = x[myrows[j]] - A_Local[j][i]*x[i];
}
if(mynode==0)
{
x[0] = x[0]/A_Local[cnt][0];
MPI_Bcast(x,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
}
else
MPI_Bcast(x,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
elapsed_time = MPI_Wtime() - elapsed_time;
if(mynode==0)
{
for(i=0; i < size;i++)
cout <<"x["< cout <<"Elapsed time: " << elapsed_time <<" sec" << endl;
}
delete [] tmp;
delete [] myrows;
for(i = 0; i < numrows; i++)
delete [] A_Local[i];
delete [] A_Local;
for( i = 0; i < size; i++)
delete [] ttA[i];
delete [] ttA;
MPI_Finalize();
return 0;
}
Calculating the Hamming Code
The key to the Hamming Code is the use of extra parity bits to allow the identification of a single error. Create the code word as follows:
1. Mark all bit positions that are powers of two as parity bits. (positions 1, 2, 4, 8, 16, 32, 64, etc.)
2. All other bit positions are for the data to be encoded. (positions 3, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 17, etc.)
3. Each parity bit calculates the parity for some of the bits in the code word. The position of the parity bit determines the sequence of bits that it alternately checks and skips.
Position 1: check 1 bit, skip 1 bit, check 1 bit, skip 1 bit, etc. (1,3,5,7,9,11,13,15,...)
Position 2: check 2 bits, skip 2 bits, check 2 bits, skip 2 bits, etc. (2,3,6,7,10,11,14,15,...)
Position 4: check 4 bits, skip 4 bits, check 4 bits, skip 4 bits, etc. (4,5,6,7,12,13,14,15,20,21,22,23,...)
Position 8: check 8 bits, skip 8 bits, check 8 bits, skip 8 bits, etc. (8-15,24-31,40-47,...)
Position 16: check 16 bits, skip 16 bits, check 16 bits, skip 16 bits, etc. (16-31,48-63,80-95,...)
Position 32: check 32 bits, skip 32 bits, check 32 bits, skip 32 bits, etc. (32-63,96-127,160-191,...)
etc.
4. Set a parity bit to 1 if the total number of ones in the positions it checks is odd. Set a parity bit to 0 if the total number of ones in the positions it checks is even.
Here is an example:
A byte of data: 10011010
Create the data word, leaving spaces for the parity bits: _ _ 1 _ 0 0 1 _ 1 0 1 0
Calculate the parity for each parity bit (a ? represents the bit position being set):
* Position 1 checks bits 1,3,5,7,9,11:
? _ 1 _ 0 0 1 _ 1 0 1 0. Even parity so set position 1 to a 0: 0 _ 1 _ 0 0 1 _ 1 0 1 0
* Position 2 checks bits 2,3,6,7,10,11:
0 ? 1 _ 0 0 1 _ 1 0 1 0. Odd parity so set position 2 to a 1: 0 1 1 _ 0 0 1 _ 1 0 1 0
* Position 4 checks bits 4,5,6,7,12:
0 1 1 ? 0 0 1 _ 1 0 1 0. Odd parity so set position 4 to a 1: 0 1 1 1 0 0 1 _ 1 0 1 0
* Position 8 checks bits 8,9,10,11,12:
0 1 1 1 0 0 1 ? 1 0 1 0. Even parity so set position 8 to a 0: 0 1 1 1 0 0 1 0 1 0 1 0
* Code word: 011100101010.
Finding and fixing a bad bit
The above example created a code word of 011100101010. Suppose the word that was received was 011100101110 instead. Then the receiver could calculate which bit was wrong and correct it. The method is to verify each check bit. Write down all the incorrect parity bits. Doing so, you will discover that parity bits 2 and 8 are incorrect. It is not an accident that 2 + 8 = 10, and that bit position 10 is the location of the bad bit. In general, check each parity bit, and add the positions that are wrong, this will give you the location of the bad bit.
Try one yourself
Test if these code words are correct, assuming they were created using an even parity Hamming Code . If one is incorrect, indicate what the correct code word should have been. Also, indicate what the original data was.
* 010101100011
* 111110001100
* 000010001010
1. Mark all bit positions that are powers of two as parity bits. (positions 1, 2, 4, 8, 16, 32, 64, etc.)
2. All other bit positions are for the data to be encoded. (positions 3, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 17, etc.)
3. Each parity bit calculates the parity for some of the bits in the code word. The position of the parity bit determines the sequence of bits that it alternately checks and skips.
Position 1: check 1 bit, skip 1 bit, check 1 bit, skip 1 bit, etc. (1,3,5,7,9,11,13,15,...)
Position 2: check 2 bits, skip 2 bits, check 2 bits, skip 2 bits, etc. (2,3,6,7,10,11,14,15,...)
Position 4: check 4 bits, skip 4 bits, check 4 bits, skip 4 bits, etc. (4,5,6,7,12,13,14,15,20,21,22,23,...)
Position 8: check 8 bits, skip 8 bits, check 8 bits, skip 8 bits, etc. (8-15,24-31,40-47,...)
Position 16: check 16 bits, skip 16 bits, check 16 bits, skip 16 bits, etc. (16-31,48-63,80-95,...)
Position 32: check 32 bits, skip 32 bits, check 32 bits, skip 32 bits, etc. (32-63,96-127,160-191,...)
etc.
4. Set a parity bit to 1 if the total number of ones in the positions it checks is odd. Set a parity bit to 0 if the total number of ones in the positions it checks is even.
Here is an example:
A byte of data: 10011010
Create the data word, leaving spaces for the parity bits: _ _ 1 _ 0 0 1 _ 1 0 1 0
Calculate the parity for each parity bit (a ? represents the bit position being set):
* Position 1 checks bits 1,3,5,7,9,11:
? _ 1 _ 0 0 1 _ 1 0 1 0. Even parity so set position 1 to a 0: 0 _ 1 _ 0 0 1 _ 1 0 1 0
* Position 2 checks bits 2,3,6,7,10,11:
0 ? 1 _ 0 0 1 _ 1 0 1 0. Odd parity so set position 2 to a 1: 0 1 1 _ 0 0 1 _ 1 0 1 0
* Position 4 checks bits 4,5,6,7,12:
0 1 1 ? 0 0 1 _ 1 0 1 0. Odd parity so set position 4 to a 1: 0 1 1 1 0 0 1 _ 1 0 1 0
* Position 8 checks bits 8,9,10,11,12:
0 1 1 1 0 0 1 ? 1 0 1 0. Even parity so set position 8 to a 0: 0 1 1 1 0 0 1 0 1 0 1 0
* Code word: 011100101010.
Finding and fixing a bad bit
The above example created a code word of 011100101010. Suppose the word that was received was 011100101110 instead. Then the receiver could calculate which bit was wrong and correct it. The method is to verify each check bit. Write down all the incorrect parity bits. Doing so, you will discover that parity bits 2 and 8 are incorrect. It is not an accident that 2 + 8 = 10, and that bit position 10 is the location of the bad bit. In general, check each parity bit, and add the positions that are wrong, this will give you the location of the bad bit.
Try one yourself
Test if these code words are correct, assuming they were created using an even parity Hamming Code . If one is incorrect, indicate what the correct code word should have been. Also, indicate what the original data was.
* 010101100011
* 111110001100
* 000010001010
Arithmetic operations in negative base system
A negative base (or negative radix) may be used to construct a non-standard positional numeral system. Like other place-value systems, each position holds multiples of the appropriate power of the system's base; but that base is negative — that is to say, the base b is equal to − r for some natural number r (r≥2).
Negative-base systems can accommodate all the same numbers as standard place-value systems; but both positive and negative numbers are represented without the use of a minus sign (or, in computer representation, a sign bit); this advantage is countered by an increased complexity of arithmetic operations. The need to store the "information" normally contained by a negative sign often results in a negative-base number being one digit longer than its positive-base equivalent.
The common names for negative-base positional numeral systems are formed by prefixing nega- to the name of the corresponding positive-base system; for example, negadecimal (base -10) corresponds to decimal (base 10), negaternary (base -3) to ternary (base 3), and negabinary (base -2) to binary (base 2).
Contents
[hide]
* 1 Example
* 2 History
* 3 Notation and use
* 4 Calculation
* 5 Arithmetic operations
o 5.1 Addition
o 5.2 Subtraction
o 5.3 Multiplication and division
* 6 Fractional numbers
o 6.1 Non-unique representations
* 7 Imaginary base
* 8 See also
* 9 References
[edit] Example
Consider what is meant by the representation 12243 in the negadecimal system, whose base b is -10:
multiples of b4
(i.e. 10000) multiples of b3
(i.e. -1000) multiples of b2
(i.e. 100) multiples of b1
(i.e. -10) multiples of b0
(i.e. 1)
1 2 2 4 3
Since 10000 + (-2000) + 200 + (-40) + 3 = 8163, the representation 12243 in negadecimal notation is equivalent to 8163 in decimal notation.
[edit] History
Negative numerical bases were first considered by Vittorio Grunwald in his work Giornale di Matematiche di Battaglini, published in 1885. Grunwald gave algorithms for performing addition, subtraction, multiplication, division, root extraction, divisibility tests, and radix conversion. Negative bases were later independently rediscovered by A. J. Kempner in 1936 and Zdzisław Pawlak and A. Wakulicz in 1959.
Negabinary was first implemented in computer hardware in the experimental Polish computers SKRZAT 1 and BINEG in 1950. Implementations since then have been rare.
[edit] Notation and use
Denoting the base as − r, every integer a can be written uniquely as
a = \sum_{i=0}^{n}d_{i}(-r)^{i}
where each digit dk is an integer from 0 to r − 1 and the leading digit dn is > 0 (unless n = 0). The base − r expansion of a is then given by the string d_n d_{n-1} \ldots d_1 d_0.
Negative-base systems may thus be compared to signed-digit representations, such as balanced ternary, where the radix is positive but the digits are taken from a partially negative range.
Some numbers have the same representation in base − r as in base r. For example, the numbers from 100 to 109 have the same representations in decimal and negadecimal. Similarly,
17 = 24 + 20 = ( − 2)4 + ( − 2)0
and is represented by 10001 in binary and 10001 in negabinary.
The numbers -15 to 15 with their expansions in a number of positive and corresponding negative bases are:
Decimal Negadecimal Binary Negabinary Ternary Negaternary
-15 25 -1111 110001 -120 1220
-14 26 -1110 110110 -112 1221
-13 27 -1101 110111 -111 1222
-12 28 -1100 110100 -110 1210
-11 29 -1011 110101 -102 1211
-10 10 -1010 1010 -101 1212
-9 11 -1001 1011 -100 1200
-8 12 -1000 1000 -22 1201
-7 13 -111 1001 -21 1202
-6 14 -110 1110 -20 20
-5 15 -101 1111 -12 21
-4 16 -100 1100 -11 22
-3 17 -11 1101 -10 10
-2 18 -10 10 -2 11
-1 19 -1 11 -1 12
0 0 0 0 0 0
1 1 1 1 1 1
2 2 10 110 2 2
3 3 11 111 10 120
4 4 100 100 11 121
5 5 101 101 12 122
6 6 110 11010 20 110
7 7 111 11011 21 111
8 8 1000 11000 22 112
9 9 1001 11001 100 100
10 190 1010 11110 101 101
11 191 1011 11111 102 102
12 192 1100 11100 110 220
13 193 1101 11101 111 221
14 194 1110 10010 112 222
15 195 1111 10011 120 210
Note that the base − r expansions of negative integers have an even number of digits, while the base − r expansions of the non-negative integers have an odd number of digits.
[edit] Calculation
The base − r expansion of a number can be found by repeated division by − r, recording the non-negative remainders of 0, 1,\ldots r-1, and concatenating those remainders, starting with the last. Note that if a / b = c, remainder d, then bc + d = a. For example, in negaternary:
\begin{align} 146 & ~/~ -3 = & -48, & ~\mbox{remainder}~ 2 \\ -48 & ~/~ -3 = & 16, & ~\mbox{remainder}~ 0 \\ 16 & ~/~ -3 = & -5, & ~\mbox{remainder}~ 1 \\ -5 & ~/~ -3 = & 2, & ~\mbox{remainder}~ 1 \\ 2 & ~/~ -3 = & 0, & ~\mbox{remainder}~ 2 \\ \end{align}
Therefore, the negaternary expansion of 146 is 21102.
Note that in most programming languages, the result (in integer arithmetic) of dividing a negative number by a negative number is rounded towards 0, usually leaving a negative remainder; to get the correct result in such case, computer implementations of the above algorithm should add 1 and r to the quotient and remainder respectively:
def negaternary(i):
digits = []
while i != 0:
i, remainder = divmod (i, -3)
if (remainder < 0):
i, remainder = i + 1, remainder + 3
digits.insert (0, str (remainder))
return ''.join (digits)
[edit] Arithmetic operations
The following describes the arithmetic operations for negabinary; calculations in larger bases are similar.
[edit] Addition
To add two negabinary numbers, start with a carry of 0, and, starting from the least significant bits, add the bits of the two numbers plus the carry. The resulting number is then looked up in the following table to get the bit to write down as result, and the next carry:
number bit carry
-2 0 1 (Note: -2 occurs only during subtraction.)
-1 1 1
0 0 0
1 1 0
2 0 -1
3 1 -1 (Note: 3 occurs only during addition.)
The second row of this table, for instance, expresses the fact that -1 = 1 + 1×(-2); the fifth row says 2 = 0 + -1×(-2); etc.
As an example, to add 1010101 (1+4+16+64 = 85) and 1110100 (4+16-32+64 = 52),
carry: 1 -1 0 -1 1 -1 0 0 0
first number: 1 0 1 0 1 0 1
second number: 1 1 1 0 1 0 0 +
--------------------------
number: 1 -1 2 0 3 -1 2 0 1
bit (result): 1 1 0 0 1 1 0 0 1
carry: 0 1 -1 0 -1 1 -1 0 0
so the result is 110011001 (1-8+16-128+256 = 137).
[edit] Subtraction
To subtract, multiply each bit of the second number by -1, and add the numbers, using the same table as above.
As an example, to compute 1101001 (1-8-32+64 = 25) minus 1110100 (4+16-32+64 = 52),
carry: 0 1 -1 1 0 0 0
first number: 1 1 0 1 0 0 1
second number: -1 -1 -1 0 -1 0 0 +
--------------------
number: 0 1 -2 2 -1 0 1
bit (result): 0 1 0 0 1 0 1
carry: 0 0 1 -1 1 0 0
so the result is 100101 (1+4-32 = -27).
To negate a number, compute 0 minus the number.
[edit] Multiplication and division
Shifting to the left multiplies by -2, shifting to the right divides by -2.
To multiply, multiply like normal decimal or binary numbers, but using the negabinary rules for adding the carry, when adding the numbers.
first number: 1 1 1 0 1 1 0
second number: 1 0 1 1 0 1 1 *
-------------------------------------
1 1 1 0 1 1 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0 +
-------------------------------------
carry: 0 -1 0 -1 -1 -1 -1 -1 0 -1 0 0
number: 1 0 2 1 2 2 2 3 2 0 2 1 0
bit (result): 1 0 0 1 0 0 0 1 0 0 0 1 0
carry: 0 -1 0 -1 -1 -1 -1 -1 0 -1 0 0
For each column, add carry to number, and divide the sum by -2, to get the new carry, and the resulting bit as the remainder.
[edit] Fractional numbers
Base − r representation may of course be carried beyond the radix point, allowing the representation of non-integral numbers.
As with positive-base systems, terminating representations correspond to fractions where the denominator is a power of the base; repeating representations correspond to other rationals, and for the same reason.
[edit] Non-unique representations
Unlike positive-base systems, where integers and terminating fractions have non-unique representations (for example, in decimal 0.999... = 1) in negative-base systems the integers have only a single representation. However, there do exist rationals with non-unique representations; for example, in negaternary,
0.(02)\ldots_{(-3)} = \frac{1}{4} = 1.(20)\ldots_{(-3)}.
Such non-unique representations can be found by considering the largest and smallest possible representations with integral parts 0 and 1 respectively, and then noting that they are equal. (Indeed, this works with any integral-base system.) The rationals thus non-uniquely expressible are those of form
\frac{ar + 1}{b(r + 1)}.
[edit] Imaginary base
Main article: Quater-imaginary base
The construction of the integers from the natural numbers, and of the complex numbers from the reals, suggests the use of an imaginary number as the radix of a number system. The earliest known imaginary-base number system is the quater-imaginary base, first proposed by Donald Knuth in 1955, with radix 2i and digits 0, 1, 2, and 3; it can represent any complex number in a single string without the use of i or minus sign.[1]
Imaginary-base arithmetic is not much different from negative-base arithmetic, since an imaginary-base number may be considered as the interleave of its real and imaginary parts; using INTERCAL-72 notation,
x(2i) + (2i)y(2i) = x(2i) ¢ y(2i).
Further extension to hexadeci-quaternion or even hexapenintadiacosi-octonion bases is possible, but not particularly useful.
[edit] See also
* binary
* balanced ternary
* numeral systems
[edit] References
1. ^ D. Knuth. The Art of Computer Programming. Volume 2, 3rd Edition. Addison-Wesley. pp. 205, "Positional Number Systems"
* Vittorio Grunwald. Giornale di Matematiche di Battaglini (1885), 203-221, 367
* A. J. Kempner. (1936), 610-617
* Z. Pawlak and A. Wakulicz Bulletin de l'Academie Polonaise des Scienses, Classe III, 5 (1957), 233-236; Serie des sciences techniques 7 (1959), 713-721
* L. Wadel IRE Transactions EC-6 1957, 123
* N. M. Blachman, Communications of the ACM (1961), 257
* IEEE Transactions 1963, 274-276
* Computer Design May 1967, 52-63
* R. W. Marczynski, Annotated History of Computing, 1980, 37-48
* D. Knuth. The Art of Computer Programming, Volume 2, 3rd. Ed. pp204-205
* Weisstein, Eric W. "Negadecimal." From MathWorld--A Wolfram Web Resource.
Negative-base systems can accommodate all the same numbers as standard place-value systems; but both positive and negative numbers are represented without the use of a minus sign (or, in computer representation, a sign bit); this advantage is countered by an increased complexity of arithmetic operations. The need to store the "information" normally contained by a negative sign often results in a negative-base number being one digit longer than its positive-base equivalent.
The common names for negative-base positional numeral systems are formed by prefixing nega- to the name of the corresponding positive-base system; for example, negadecimal (base -10) corresponds to decimal (base 10), negaternary (base -3) to ternary (base 3), and negabinary (base -2) to binary (base 2).
Contents
[hide]
* 1 Example
* 2 History
* 3 Notation and use
* 4 Calculation
* 5 Arithmetic operations
o 5.1 Addition
o 5.2 Subtraction
o 5.3 Multiplication and division
* 6 Fractional numbers
o 6.1 Non-unique representations
* 7 Imaginary base
* 8 See also
* 9 References
[edit] Example
Consider what is meant by the representation 12243 in the negadecimal system, whose base b is -10:
multiples of b4
(i.e. 10000) multiples of b3
(i.e. -1000) multiples of b2
(i.e. 100) multiples of b1
(i.e. -10) multiples of b0
(i.e. 1)
1 2 2 4 3
Since 10000 + (-2000) + 200 + (-40) + 3 = 8163, the representation 12243 in negadecimal notation is equivalent to 8163 in decimal notation.
[edit] History
Negative numerical bases were first considered by Vittorio Grunwald in his work Giornale di Matematiche di Battaglini, published in 1885. Grunwald gave algorithms for performing addition, subtraction, multiplication, division, root extraction, divisibility tests, and radix conversion. Negative bases were later independently rediscovered by A. J. Kempner in 1936 and Zdzisław Pawlak and A. Wakulicz in 1959.
Negabinary was first implemented in computer hardware in the experimental Polish computers SKRZAT 1 and BINEG in 1950. Implementations since then have been rare.
[edit] Notation and use
Denoting the base as − r, every integer a can be written uniquely as
a = \sum_{i=0}^{n}d_{i}(-r)^{i}
where each digit dk is an integer from 0 to r − 1 and the leading digit dn is > 0 (unless n = 0). The base − r expansion of a is then given by the string d_n d_{n-1} \ldots d_1 d_0.
Negative-base systems may thus be compared to signed-digit representations, such as balanced ternary, where the radix is positive but the digits are taken from a partially negative range.
Some numbers have the same representation in base − r as in base r. For example, the numbers from 100 to 109 have the same representations in decimal and negadecimal. Similarly,
17 = 24 + 20 = ( − 2)4 + ( − 2)0
and is represented by 10001 in binary and 10001 in negabinary.
The numbers -15 to 15 with their expansions in a number of positive and corresponding negative bases are:
Decimal Negadecimal Binary Negabinary Ternary Negaternary
-15 25 -1111 110001 -120 1220
-14 26 -1110 110110 -112 1221
-13 27 -1101 110111 -111 1222
-12 28 -1100 110100 -110 1210
-11 29 -1011 110101 -102 1211
-10 10 -1010 1010 -101 1212
-9 11 -1001 1011 -100 1200
-8 12 -1000 1000 -22 1201
-7 13 -111 1001 -21 1202
-6 14 -110 1110 -20 20
-5 15 -101 1111 -12 21
-4 16 -100 1100 -11 22
-3 17 -11 1101 -10 10
-2 18 -10 10 -2 11
-1 19 -1 11 -1 12
0 0 0 0 0 0
1 1 1 1 1 1
2 2 10 110 2 2
3 3 11 111 10 120
4 4 100 100 11 121
5 5 101 101 12 122
6 6 110 11010 20 110
7 7 111 11011 21 111
8 8 1000 11000 22 112
9 9 1001 11001 100 100
10 190 1010 11110 101 101
11 191 1011 11111 102 102
12 192 1100 11100 110 220
13 193 1101 11101 111 221
14 194 1110 10010 112 222
15 195 1111 10011 120 210
Note that the base − r expansions of negative integers have an even number of digits, while the base − r expansions of the non-negative integers have an odd number of digits.
[edit] Calculation
The base − r expansion of a number can be found by repeated division by − r, recording the non-negative remainders of 0, 1,\ldots r-1, and concatenating those remainders, starting with the last. Note that if a / b = c, remainder d, then bc + d = a. For example, in negaternary:
\begin{align} 146 & ~/~ -3 = & -48, & ~\mbox{remainder}~ 2 \\ -48 & ~/~ -3 = & 16, & ~\mbox{remainder}~ 0 \\ 16 & ~/~ -3 = & -5, & ~\mbox{remainder}~ 1 \\ -5 & ~/~ -3 = & 2, & ~\mbox{remainder}~ 1 \\ 2 & ~/~ -3 = & 0, & ~\mbox{remainder}~ 2 \\ \end{align}
Therefore, the negaternary expansion of 146 is 21102.
Note that in most programming languages, the result (in integer arithmetic) of dividing a negative number by a negative number is rounded towards 0, usually leaving a negative remainder; to get the correct result in such case, computer implementations of the above algorithm should add 1 and r to the quotient and remainder respectively:
def negaternary(i):
digits = []
while i != 0:
i, remainder = divmod (i, -3)
if (remainder < 0):
i, remainder = i + 1, remainder + 3
digits.insert (0, str (remainder))
return ''.join (digits)
[edit] Arithmetic operations
The following describes the arithmetic operations for negabinary; calculations in larger bases are similar.
[edit] Addition
To add two negabinary numbers, start with a carry of 0, and, starting from the least significant bits, add the bits of the two numbers plus the carry. The resulting number is then looked up in the following table to get the bit to write down as result, and the next carry:
number bit carry
-2 0 1 (Note: -2 occurs only during subtraction.)
-1 1 1
0 0 0
1 1 0
2 0 -1
3 1 -1 (Note: 3 occurs only during addition.)
The second row of this table, for instance, expresses the fact that -1 = 1 + 1×(-2); the fifth row says 2 = 0 + -1×(-2); etc.
As an example, to add 1010101 (1+4+16+64 = 85) and 1110100 (4+16-32+64 = 52),
carry: 1 -1 0 -1 1 -1 0 0 0
first number: 1 0 1 0 1 0 1
second number: 1 1 1 0 1 0 0 +
--------------------------
number: 1 -1 2 0 3 -1 2 0 1
bit (result): 1 1 0 0 1 1 0 0 1
carry: 0 1 -1 0 -1 1 -1 0 0
so the result is 110011001 (1-8+16-128+256 = 137).
[edit] Subtraction
To subtract, multiply each bit of the second number by -1, and add the numbers, using the same table as above.
As an example, to compute 1101001 (1-8-32+64 = 25) minus 1110100 (4+16-32+64 = 52),
carry: 0 1 -1 1 0 0 0
first number: 1 1 0 1 0 0 1
second number: -1 -1 -1 0 -1 0 0 +
--------------------
number: 0 1 -2 2 -1 0 1
bit (result): 0 1 0 0 1 0 1
carry: 0 0 1 -1 1 0 0
so the result is 100101 (1+4-32 = -27).
To negate a number, compute 0 minus the number.
[edit] Multiplication and division
Shifting to the left multiplies by -2, shifting to the right divides by -2.
To multiply, multiply like normal decimal or binary numbers, but using the negabinary rules for adding the carry, when adding the numbers.
first number: 1 1 1 0 1 1 0
second number: 1 0 1 1 0 1 1 *
-------------------------------------
1 1 1 0 1 1 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0
1 1 1 0 1 1 0 +
-------------------------------------
carry: 0 -1 0 -1 -1 -1 -1 -1 0 -1 0 0
number: 1 0 2 1 2 2 2 3 2 0 2 1 0
bit (result): 1 0 0 1 0 0 0 1 0 0 0 1 0
carry: 0 -1 0 -1 -1 -1 -1 -1 0 -1 0 0
For each column, add carry to number, and divide the sum by -2, to get the new carry, and the resulting bit as the remainder.
[edit] Fractional numbers
Base − r representation may of course be carried beyond the radix point, allowing the representation of non-integral numbers.
As with positive-base systems, terminating representations correspond to fractions where the denominator is a power of the base; repeating representations correspond to other rationals, and for the same reason.
[edit] Non-unique representations
Unlike positive-base systems, where integers and terminating fractions have non-unique representations (for example, in decimal 0.999... = 1) in negative-base systems the integers have only a single representation. However, there do exist rationals with non-unique representations; for example, in negaternary,
0.(02)\ldots_{(-3)} = \frac{1}{4} = 1.(20)\ldots_{(-3)}.
Such non-unique representations can be found by considering the largest and smallest possible representations with integral parts 0 and 1 respectively, and then noting that they are equal. (Indeed, this works with any integral-base system.) The rationals thus non-uniquely expressible are those of form
\frac{ar + 1}{b(r + 1)}.
[edit] Imaginary base
Main article: Quater-imaginary base
The construction of the integers from the natural numbers, and of the complex numbers from the reals, suggests the use of an imaginary number as the radix of a number system. The earliest known imaginary-base number system is the quater-imaginary base, first proposed by Donald Knuth in 1955, with radix 2i and digits 0, 1, 2, and 3; it can represent any complex number in a single string without the use of i or minus sign.[1]
Imaginary-base arithmetic is not much different from negative-base arithmetic, since an imaginary-base number may be considered as the interleave of its real and imaginary parts; using INTERCAL-72 notation,
x(2i) + (2i)y(2i) = x(2i) ¢ y(2i).
Further extension to hexadeci-quaternion or even hexapenintadiacosi-octonion bases is possible, but not particularly useful.
[edit] See also
* binary
* balanced ternary
* numeral systems
[edit] References
1. ^ D. Knuth. The Art of Computer Programming. Volume 2, 3rd Edition. Addison-Wesley. pp. 205, "Positional Number Systems"
* Vittorio Grunwald. Giornale di Matematiche di Battaglini (1885), 203-221, 367
* A. J. Kempner. (1936), 610-617
* Z. Pawlak and A. Wakulicz Bulletin de l'Academie Polonaise des Scienses, Classe III, 5 (1957), 233-236; Serie des sciences techniques 7 (1959), 713-721
* L. Wadel IRE Transactions EC-6 1957, 123
* N. M. Blachman, Communications of the ACM (1961), 257
* IEEE Transactions 1963, 274-276
* Computer Design May 1967, 52-63
* R. W. Marczynski, Annotated History of Computing, 1980, 37-48
* D. Knuth. The Art of Computer Programming, Volume 2, 3rd. Ed. pp204-205
* Weisstein, Eric W. "Negadecimal." From MathWorld--A Wolfram Web Resource.
Подписаться на:
Сообщения (Atom)