суббота, 17 мая 2008 г.
понедельник, 12 мая 2008 г.
What Is Direct Search?
Direct search is a method for solving optimization problems that does not require any information about the gradient of the objective function. As opposed to more traditional optimization methods that use information about the gradient or higher derivatives to search for an optimal point, a direct search algorithm searches a set of points around the current point, looking for one where the value of the objective function is lower than the value at the current point. You can use direct search to solve problems for which the objective function is not differentiable, or even continuous. The Genetic Algorithm and Direct Search Toolbox implements a special class of direct search algorithms called pattern search algorithms. A pattern search algorithm computes a sequence of points that get closer and closer to the optimal point. At each step, the algorithm searches a set of points, called a mesh, around the current point -- the point computed at the previous step of the algorithm. The algorithm forms the mesh by adding the current point to a scalar multiple of a fixed set of vectors called a pattern. If the algorithm finds a point in the mesh that improves the objective function at the current point, the new point becomes the current point at the next step of the algorithm.
Pattern Search Terminology
This section explains some standard terminology for pattern search, including
Patterns
Meshes
Polling
Patterns
A pattern is a collection of vectors that the algorithm uses to determine which points to search at each iteration
Pattern Search Terminology
This section explains some standard terminology for pattern search, including
Patterns
Meshes
Polling
Patterns
A pattern is a collection of vectors that the algorithm uses to determine which points to search at each iteration
What Is the Genetic Algorithm?
The generic algorithm uses three main types of rules at each step to create the next generation from the current population:
+ Selection rules select the individuals, called parents, that contribute to the population at the next generation.
+ Crossover rules combine two parents to form children for the next generation.
+ Mutation rules apply random changes to individual parents to form children.
Standard Algorithm:
- Generates a single point at each iteration. The sequence of points approaches an optimal solution.
- Selects the next point in the sequence by a deterministic computation.
Genetic algorithm:
- Generates a population of points at each iteration. The population approaches an optimal solution
- Selects the next population by computations that involve random choices.
Some Genetic Algorithm Terminology:
Fitness Functions
Individuals Populations and Generations
Fitness Values and Best Fitness Values
Parents and Children
Algorithm
1. The algorithm begins by creating a random initial population.
2. The algorithm then creates a sequence of new population( generations). At each step, the algorithm uses the individuals in the current generation to create the next generation. To create the new generation, the algorithm performs the following steps:
a. Scores each member of the current population by computing its fitness value.
b. Scales the raw fitness scores to convert them into a more usable range of values
c. Select parents based on their fitness
d. Produces children from the parents. Children are produced either by making random changes to a single parent - mutation, or by combining the vector entries of a pair of parents - crossover.
e. Replaces the current population with the children to form the next generation.
3. The algorithm stops when one of the stopping criteria is met.
[Stopping Conditions for the Algorithm]
a. Generations -- The algorithm stops when the number of generations reaches the value of Generations.
b. ime limit -- The algorithm stops after running for an amount of time in seconds equal to Time limit.
c. Fitness limit -- The algorithm stops when the value of the fitness function for the best point in the current population is less than or equal to Fitness limit
d. Stall generations -- The algorithm stops if there is no improvement in the objective function for a sequence of consecutive generations of length Stall generations.
e. Stall time limit -- The algorithm stops if there is no improvement in the objective function during an interval of time in seconds equal to Stall time limit.
+ Selection rules select the individuals, called parents, that contribute to the population at the next generation.
+ Crossover rules combine two parents to form children for the next generation.
+ Mutation rules apply random changes to individual parents to form children.
Standard Algorithm:
- Generates a single point at each iteration. The sequence of points approaches an optimal solution.
- Selects the next point in the sequence by a deterministic computation.
Genetic algorithm:
- Generates a population of points at each iteration. The population approaches an optimal solution
- Selects the next population by computations that involve random choices.
Some Genetic Algorithm Terminology:
Fitness Functions
Individuals Populations and Generations
Fitness Values and Best Fitness Values
Parents and Children
Algorithm
1. The algorithm begins by creating a random initial population.
2. The algorithm then creates a sequence of new population( generations). At each step, the algorithm uses the individuals in the current generation to create the next generation. To create the new generation, the algorithm performs the following steps:
a. Scores each member of the current population by computing its fitness value.
b. Scales the raw fitness scores to convert them into a more usable range of values
c. Select parents based on their fitness
d. Produces children from the parents. Children are produced either by making random changes to a single parent - mutation, or by combining the vector entries of a pair of parents - crossover.
e. Replaces the current population with the children to form the next generation.
3. The algorithm stops when one of the stopping criteria is met.
[Stopping Conditions for the Algorithm]
a. Generations -- The algorithm stops when the number of generations reaches the value of Generations.
b. ime limit -- The algorithm stops after running for an amount of time in seconds equal to Time limit.
c. Fitness limit -- The algorithm stops when the value of the fitness function for the best point in the current population is less than or equal to Fitness limit
d. Stall generations -- The algorithm stops if there is no improvement in the objective function for a sequence of consecutive generations of length Stall generations.
e. Stall time limit -- The algorithm stops if there is no improvement in the objective function during an interval of time in seconds equal to Stall time limit.
Dynamic random access memory
Computer memory types
Volatile
* DRAM, e.g. DDR SDRAM
* SRAM
* Upcoming
o Z-RAM
o TTRAM
* Historical
o Williams tube
o Delay line memory
Non-volatile
* ROM
o PROM
o EAROM
o EPROM
o EEPROM
* Flash memory
* Upcoming
o FeRAM
o MRAM
o Memristor
o PRAM
o SONOS
o RRAM
o Racetrack memory
o NRAM
* Historical
o Drum memory
o Magnetic core memory
o Bubble memory
o Twistor memory
Dynamic random access memory (DRAM) is a type of random access memory that stores each bit of data in a separate capacitor within an integrated circuit. Since real capacitors leak charge, the information eventually fades unless the capacitor charge is refreshed periodically. Because of this refresh requirement, it is a dynamic memory as opposed to SRAM and other static memory.
The advantage of DRAM is its structural simplicity: only one транзистор and a конденсатор are required per bit, compared to six transistors in SRAM. This allows DRAM to reach very high density. Like SRAM, it is in the class of volatile memory devices, since it loses its data when the power supply is removed. Unlike SRAM however, data may still be recovered for a short time after power-off.
Contents
[hide]
* 1 History
* 2 Operation principle
o 2.1 Memory timing
* 3 Errors and error correction
* 4 DRAM packaging
o 4.1 General DRAM packaging formats
o 4.2 Common DRAM modules
* 5 Variations
o 5.1 Asynchronous DRAM
o 5.2 Video DRAM (VRAM)
o 5.3 Fast Page Mode DRAM (FPM)
o 5.4 CAS before RAS refresh
o 5.5 Extended Data Out (EDO) DRAM
o 5.6 Burst EDO (BEDO) DRAM
o 5.7 Multibank DRAM (MDRAM)
o 5.8 Synchronous Graphics RAM (SGRAM)
o 5.9 Synchronous Dynamic RAM (SDRAM)
o 5.10 Direct Rambus DRAM (DRDRAM)
o 5.11 Double Data Rate (DDR) SDRAM
o 5.12 Pseudostatic RAM (PSRAM)
o 5.13 1T DRAM
o 5.14 RLDRAM
* 6 Security
* 7 See also
* 8 References
* 9 External links
[edit] History
This section requires expansion.
Schematic drawing of original designs of DRAM patented in 1968.
Schematic drawing of original designs of DRAM patented in 1968.
1964 Arnold Farber and Eugene Schlig working for IBM created a memory cell that was hard wired; using a transistor gate and tunnel diode latch, they later replaced the latch with two transistors and two resistors and this became known as the Farber-Schlig cell. 1965 Benjamin Agusta and his team working for IBM managed to create a 16-bit silicon chip memory cell based on the Farber-Schlig cell which consisted of 80 transistors, 64 resistors and 4 diodes. 1966 DRAM was invented by Dr. Robert Dennard at the IBM Thomas J. Watson Research Center and he was awarded U.S. patent number 3,387,286 in 1968. Capacitors had been used for earlier memory schemes such as the drum of the Atanasoff–Berry Computer, the Williams tube and the Selectron tube.
The Toshiba "Toscal" BC-1411 electronic calculator, which went into production in November 1965, uses a form of dynamic RAM built from discrete components.[1]
In 1969, Honeywell asked Intel to make a DRAM using a 3-transistor cell that they had developed. This became the Intel 1102 (1024x1) in early 1970. However the 1102 had many problems, prompting Intel to begin work on their own improved design (secretly to avoid conflict with Honeywell). This became the first commercially available 1-transistor cell DRAM, the Intel 1103 (1024x1) in October 1970 (despite initial problems with low yield, until the 5th revision of the masks).
The first DRAM with multiplexed row/column address lines was the Mostek MK4096 (4096x1) in 1973. Mostek held an 85% market share of the dynamic random access memory (DRAM) memory chip market worldwide, until being eclipsed by Japanese DRAM manufacturers who offered equivalent chips at lower prices.
[edit] Operation principle
Principle of operation of DRAM read, for simple 4 by 4 array.
Principle of operation of DRAM read, for simple 4 by 4 array.
Principle of operation of DRAM write, for simple 4 by 4 array.
Principle of operation of DRAM write, for simple 4 by 4 array.
DRAM is usually arranged in a square array of one capacitor and transistor per cell. The illustrations to the right show a simple example with only 4 by 4 cells (modern DRAM can be thousands of cells in length/width).
The long lines connecting each row are known as word lines. Each column is actually composed of two bit lines, each one connected to every other storage cell in the column. They are generally known as the + and − bit lines. A sense amplifier is essentially a pair of cross-connected inverters between the bit lines. That is, the first inverter is connected from the + bit line to the − bit line, and the second is connected from the − bit line to the + bit line. This is an example of positive feedback, and the arrangement is only stable with one bit line high and one bit line low.
To read a bit from a column, the following operations take place:
1. The sense amplifier is switched off and the bit lines are precharged to exactly matching voltages that are intermediate between high and low logic levels. The bit lines are constructed symmetrically to keep them balanced as precisely as possible.
2. The precharge circuit is switched off. Because the bit lines are very long, their capacitance will hold the precharge voltage for a brief time. This is an example of dynamic logic.
3. The selected row's word line is driven high. This connects one storage capacitor to one of the two bit lines. Charge is shared between the selected storage cell and the appropriate bit line, slightly altering the voltage on the line. Although every effort is made to keep the capacitance of the storage cells high and the capacitance of the bit lines low, capacitance is proportional to physical size, and the length of the bit lines means that the net effect is a very small perturbation of one bit line's voltage.
4. The sense amplifier is switched on. The positive feedback takes over and amplifies the small voltage difference until one bit line is fully low and the other is fully high. At this point, the column can be read.
5. At the end of a read cycle, the row values must be restored to the capacitors, which were depleted during the read: the bit line of the storage cell is also driven to full voltage (refreshed) by the action of the sense amplifier. Due to the length of the bit line, this takes significant time beyond the end of sense amplification.
To write to memory, the row is opened and a given column's sense amplifier is temporarily forced to the desired state and drives the bit line which charges the capacitor to the desired value. The amplifier will then drive the bit lines to the desired state and hold it stable even after the forcing is removed.During a write to a particular cell, the entire row is read out, one value changed, and then the entire row is written back in, as illustrated in the figure to the right.
Typically, manufacturers specify that each row should be refreshed every 64 ms or less, according to the JEDEC (Foundation for developing Semiconductor Standards) standard. Refresh logic is commonly used with DRAMs to automate the periodic refresh. This makes the circuit more complicated, but this drawback is usually outweighed by the fact that DRAM is much cheaper and of greater capacity than SRAM. Some systems refresh every row in a tight loop that occurs once every 64 ms. Other systems refresh one row at a time -- for example, a system with 213 = 8192 rows would require a refresh rate of one row every 7.8 µs (64 ms / 8192 rows). A few real-time systems refresh a portion of memory at a time based on an external timer that governs the operation of the rest of the system, such as the vertical blanking interval that occurs every 10 to 20 ms in video equipment. All methods require some sort of counter to keep track of which row is the next to be refreshed. Some DRAM chips include that counter; other kinds require external refresh logic to hold that counter. (Under some conditions, most of the data in DRAM can be recovered even if the DRAM has not been refreshed for several minutes.[2])
[edit] Memory timing
There are many numbers required to describe the speed of DRAM operation. Here are some examples for two speed grades of asynchronous DRAM, from a data sheet published in 1998:[3]
"50 ns" "60 ns" Description
tRC 84 ns 104 ns Random read or write cycle time (from one full /RAS cycle to another)
tRAC 50 ns 60 ns Access time: /RAS low to valid data out
tRCD 11 ns 14 ns /RAS low to /CAS low time
tRAS 50 ns 60 ns /RAS pulse width (minimum /RAS low time)
tRP 30 ns 40 ns /RAS precharge time (minimum /RAS high time)
tPC 20 ns 25 ns Page-mode read or write cycle time (/CAS to /CAS)
tAA 25 ns 30 ns Access time: Column address valid to valid data out (includes address setup time before /CAS low)
tCAC 13 ns 15 ns Access time: /CAS low to valid data out
tCAS 8 ns 10 ns /CAS low pulse width minimum
Thus, the generally quoted number is the /RAS access time. This is the time to read a random bit from a precharged DRAM array. The time to read additional bits from an open page is much less.
When such a RAM is accessed by clocked logic, the times are generally rounded up to the nearest clock cycle. For example, when accessed by a 100 MHz state machine (i.e. a 10 ns clock), the 50 ns DRAM can perform the first read in 5 clock cycles, and additional reads within the same page every 2 clock cycles. This was generally described as "5-2-2-2" timing, as bursts of 4 reads within a page were common.
When describing synchronous memory, timing is also described by clock cycle counts separated by hyphens, but the numbers have very different meanings! These numbers represent tCAS-tRCD-tRP-tRAS in multiples of the DRAM clock cycle time. Note that this is half of the data transfer rate when double data rate signaling is used. JEDEC standard PC3200 timing is 3-4-4-8[4] with a 200 MHz clock, while premium-priced high-speed PC3200 DDR DRAM DIMM might be operated at 2-2-2-5 timing.[5]
Cycles time Cycles time Description
tCL 3 15 ns 2 10 ns /CAS low to valid data out (equivalent to tCAC)
tRCD 4 20 ns 2 10 ns /RAS low to /CAS low time
tRP 4 20 ns 2 10 ns /RAS precharge time (minimum precharge to active time)
tRAS 8 40 ns 5 25 ns Minimum row active time (minimum active to precharge time)
It is worth noting that the improvement over 10 years is not that large. Minimum random access time has improved from 50 ns to tRCD + tCL = 35 ns, and even the premium 20 ns variety is only 2.5× better. However, the DDR memory does achieve 8 times higher bandwidth; due to internal pipelining and wide data paths, it can output one word every 2.5 ns, while the EDO DRAM can only output one word per tPC = 20 ns.
[edit] Errors and error correction
Main article: ECC memory#Error-correcting memory
Electrical or magnetic interference inside a computer system can cause a single bit of DRAM to spontaneously flip to the opposite state. Some research has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of cosmic rays, which may change the contents of one or more memory cells, or interfere with the circuitry used to read/write them. There is some concern that as DRAM density increases further, and thus the components on DRAM chips get smaller, whilst at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently - since lower energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse this trend.
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an ECC. Parity allows the detection of a single-bit error (actually, any odd number of wrong bits). The most common error correcting code, Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.
Error detection and correction in computer systems seems to go in and out of fashion. Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600.[1] He included parity in the CDC 7600, and reputedly said "I learned that a lot of farmers buy computers." 486-era PCs often used parity.[citation needed] Pentium-era ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Current microprocessor memory controllers generally support ECC but most non-server systems do not use these features. Even if they do, it is not clear that the software layers do their part.
Memory controllers in most modern PCs can typically detect, and correct errors of a single bit per 64 bit "word" (the unit of bus transfer), and detect (but not correct) errors of two bits per 64 bit word. Some systems also 'scrub' the errors, by writing the corrected version back to memory. The BIOS in some computers, and operating systems such as Linux, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic. Unfortunately, most modern PCs are supplied with memory modules that have no parity or ECC bits.
Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness does not seem to be widely addressed; one exception is Chipkill.
A reasonable rule of thumb is to expect one bit error, per month, per gigabyte of memory. Actual error rates vary widely.[ecc]
[edit] DRAM packaging
For economic reasons, the large (main) memories found in personal computers, workstations, and non-handheld game-consoles (such as Playstation and Xbox) normally consists of dynamic RAM (DRAM). Other parts of the computer, such as cache memories and data buffers in hard disks, normally use static RAM (SRAM).
[edit] General DRAM packaging formats
DDR2 SDRAM packages
DDR2 SDRAM packages
Common DRAM packages
Common DRAM packages
EDO DRAM memory module
EDO DRAM memory module
Dynamic random access memory is produced as integrated circuits (ICs) bonded and mounted into plastic packages with metal pins for connection to control signals and buses. Today, these DRAM packages are in turn often assembled into plug-in modules for easier handling. Some standard module types are:
* DRAM chip (Integrated Circuit or IC)
o Dual in-line Package (DIP)
* DRAM (memory) modules
o Single In-line Pin Package (SIPP)
o Single In-line Memory Module (SIMM)
o Dual In-line Memory Module (DIMM)
o Rambus In-line Memory Module (RIMM), technically DIMMs but called RIMMs due to their proprietary slot.
o Small outline DIMM (SO-DIMM), about half the size of regular DIMMs, are mostly used in notebooks, small footprint PCs (such as Mini-ITX motherboards), upgradable office printers and networking hardware like routers. Comes in versions with:
+ 72 pins (32-bit)
+ 144 pins (64-bit)
+ 200 pins (72-bit)
o Small outline RIMM (SO-RIMM). Smaller version of the RIMM, used in laptops. Technically SO-DIMMs but called SO-RIMMs due to their proprietary slot.
* Stacked v. non-stacked RAM modules
o Stacked RAM modules contain two or more RAM chips stacked on top of each other. This allows large modules (like 512mb or 1Gig SO-DIMM) to be manufactured using cheaper low density wafers. Stacked chip modules draw more power.
[edit] Common DRAM modules
Common DRAM packages as illustrated to the right, from top to bottom:
1. DIP 16-pin (DRAM chip, usually pre-FPRAM)
2. SIPP (usually FPRAM)
3. SIMM 30-pin (usually FPRAM)
4. SIMM 72-pin (so-called "PS/2 SIMM", usually EDO RAM)
5. DIMM 168-pin (SDRAM)
6. DIMM 184-pin (DDR SDRAM)
7. RIMM 184-pin
8. DIMM 240-pin (DDR2 SDRAM/DDR3 SDRAM)
[edit] Variations
DRAM types
* FPM RAM
* EDO RAM
* Burst EDO RAM
* SDRAM
o SDR SDRAM
o DDR SDRAM
o DDR2 SDRAM
o DDR3 SDRAM
o Rambus RAM
+ XDR DRAM
+ XDR2 DRAM
o VC-RAM
* Video RAM
o WRAM
* SGRAM
* GDDR2
* GDDR3
* GDDR4
* GDDR5
This article may require cleanup to meet Wikipedia's quality standards.
Please improve this article if you can. (November 2006)
[edit] Asynchronous DRAM
This is the basic form, from which all others are derived. An asynchronous DRAM chip has power connections, some number of address inputs (typically 12), and a few (typically 1 or 4) bidirectional data lines. There are four active low control signals:
* /RAS, the Row Address Strobe. The address inputs are captured on the falling edge of /RAS, and select a row to open. The row is held open as long as /RAS is low.
* /CAS, the Column Address Strobe. The address inputs are captured on the falling edge of /CAS, and select a column from the currently open row to read or write.
* /WE, Write Enable. This signal determines whether a given falling edge of /CAS is a read (if high) or write (if low). If low, the data inputs are also captured on the falling edge of /CAS.
* /OE, Output Enable. This is an additional signal that controls output to the data I/O pins. The data pins are driven by the DRAM chip if /RAS and /CAS are low, and /WE is high, and /OE is low. In many applications, /OE can be permanently connected low (output always enabled), but it can be useful when connecting multiple memory chips in parallel.
This interface provides direct control of internal timing. When /RAS is driven low, a /CAS cycle must not be attempted until the sense amplifiers have sensed the memory state, and /RAS must not be returned high until the storage cells have been refreshed. When /RAS is driven high, it must be held high long enough for precharging to complete.
[edit] Video DRAM (VRAM)
This article or section may require restructuring to meet Wikipedia's quality standards.
Please discuss this issue on the talk page. This article has been tagged since May 2007.
VRAM is a dual-ported variant of DRAM which was once commonly used to store the frame-buffer in some graphics adaptors.
It was invented in 1980 by F. Dill and R. Matick at IBM Research in 1980, with a patent issued in 1985 (US Patent 4,541,075). The first commercial use of VRAM was in the high resolution graphics adapter introduced in 1986 by IBM with the PC/RT system.
VRAM has two sets of data output pins, and thus two ports that can be used simultaneously. The first port, the DRAM port, is accessed by the host computer in a manner very similar to traditional DRAM. The second port, the video port, is typically read-only and is dedicated to providing a high-speed data channel for the graphics chipset.
Typical DRAM arrays normally access a full row of bits (i.e. a word line) at up to 1024 bits at one time, but only use one or a few of these for actual data, the remainder being discarded. Since DRAM cells are destructively read, each bit accessed must be sensed, and re-written. Thus, typically, 1024 sense amplifiers are typically used. VRAM operates by not discarding the excess bits which must be accessed, but making full use of them in a simple way. If each horizontal scan line of a display is mapped to a full word, then upon reading one word and latching all 1024 bits into a separate row buffer, these bits can subsequently be serially streamed to the display circuitry. This will leave access to the DRAM array free to be accessed (read or write) for many cycles, until the row buffer is almost depleted. A complete DRAM read cycle is only required to fill the row buffer, leaving most DRAM cycles available for normal accesses.
Such operation is described in the paper "All points addressible raster display memory" by R. Matick, D. Ling, S. Gupta, and F. Dill, IBM Journal of R&D, Vol 28, No. 4, July 1984, pp379-393. To use the video port, the controller first uses the DRAM port to select the row of the memory array that is to be displayed. The VRAM then copies that entire row to an internal row-buffer which is a shift-register. The controller can then continue to use the DRAM port for drawing objects on the display. Meanwhile, the controller feeds a clock called the shift clock (SCLK) to the VRAM's video port. Each SCLK pulse causes the VRAM to deliver the next item of data, in strict address order, from the shift-register to the video port. For simplicity, the graphics adapter is usually designed so that the contents of a row, and therefore the contents of the shift-register, corresponds to a complete horizontal line on the display.
In the late 1990s, standard DRAM technologies (e.g. SDRAM) became cheap, dense, and fast enough to completely displace VRAM, even though it was only single-ported and some memory bits were wasted.
[edit] Fast Page Mode DRAM (FPM)
A 256Kx4 DRAM on an early PC memory card
A 256Kx4 DRAM on an early PC memory card
Fast page mode DRAM is also called FPM DRAM, Page mode DRAM, Fast page mode memory, or Page mode memory.
In page mode, a row of the DRAM can be kept "open" by holding /RAS low while performing multiple reads or writes with separate pulses of /CAS. so that successive reads or writes within the row do not suffer the delay of precharge and accessing the row. This increases the performance of the system when reading or writing bursts of data.
Static column is a variant of page mode in which the column address does not need to be strobed in, but rather, the address inputs may be changed with /CAS held low, and the data output will be updated accordingly a few nanoseconds later.
Nibble mode is another variant in which four sequential locations within the row can be accessed with four consecutive pulses of /CAS. The difference from normal page mode is that the address inputs are not used for the second through fourth /CAS edges; they are generated internally starting with the address supplied for the first /CAS edge.
[edit] CAS before RAS refresh
Classic asynchronous DRAM is refreshed by opening each row in turn. This can be done by supplying a row address and pulsing /RAS low; it is not necessary to perform any /CAS cycles. An external counter is needed to iterate over the row addresses in turn.
For convenience, the counter was quickly incorporated into RAM chips themselves. If the /CAS line is driven low before /RAS (normally an illegal operation), then the DRAM ignores the address inputs and uses an internal counter to select the row to open. This is known as /CAS-before-/RAS (CBR) refresh.
===
It was used by Matrox on both their MGA Millennium and Millennium II graphics cards, and by Nintendo in their Game Boy Advance range.
[edit] Extended Data Out (EDO) DRAM
A pair of 32 MiB EDO DRAM modules.
A pair of 32 MiB EDO DRAM modules.
EDO DRAM is similar to Fast Page Mode DRAM with the additional feature that a new access cycle can be started while keeping the data output of the previous cycle active. This allows a certain amount of overlap in operation (pipelining), allowing somewhat improved speed. It was 5% faster than Fast Page Mode DRAM, which it began to replace in 1993.
To be precise, EDO DRAM begins data output on the falling edge of /CAS, but does not stop the output when /CAS rises again. It holds the output valid (thus extending the data output time) until either /RAS is deasserted, or a new /CAS falling edge selects a different column address.
Single-cycle EDO has the ability to carry out a complete memory transaction in one clock cycle. Otherwise, each sequential RAM access within the same page takes two clock cycles instead of three, once the page has been selected. EDO's speed and capabilities allowed it to somewhat replace the then-slow L2 caches of PCs. It created an opportunity to reduce the immense performance loss associated with a lack of L2 cache, while making systems cheaper to build. This was also good for notebooks due to difficulties with their limited form factor, and battery life limitations. An EDO system with L2 cache was tangibly faster than the older FPM/L2 combination.
Single-cycle EDO DRAM became very popular on video cards towards the end of the 1990s. It was very low cost, yet nearly as efficient for performance as the far more costly VRAM.
EDO was sometimes referred to as Hyper Page Mode.
[edit] Burst EDO (BEDO) DRAM
An evolution of the former, Burst EDO DRAM, could process four memory addresses in one burst, for a maximum of 5-1-1-1, saving an additional three clocks over optimally designed EDO memory. It was done by adding an address counter on the chip to keep track of the next address. BEDO also added a pipelined stage allowing page-access cycle to be divided into two components. During a memory-read operation, the first component accessed the data from the memory array to the output stage (second latch). The second component drove the data bus from this latch at the appropriate logic level. Since the data is already in the output buffer, faster access time is achieved (up to 50% for large blocks of data) than with traditional EDO.
Although BEDO DRAM showed additional optimization over EDO, by the time it was available the market had made a significant investment towards synchronous DRAM, or SDRAM [2]. Even though BEDO RAM was superior to SDRAM in some ways, the latter technology gained significant traction and quickly displaced BEDO.
[edit] Multibank DRAM (MDRAM)
Multibank RAM applies the interleaving technique for main memory to second level cache memory to provide a cheaper and faster alternative to SRAM. The chip splits its memory capacity into small blocks of 256 kB and allows operations to two different banks in a single clock cycle.
This memory was primarily used in graphic cards with Tseng Labs ET6x00 chipsets, and was made by MoSys. Boards based upon this chipset often used the unusual RAM size configuration of 2.25 MiB, owing to MDRAM's ability to be implemented in various sizes more easily. This size of 2.25 MiB allowed 24-bit color at a resolution of 1024×768, a very popular display setting in the card's time.
[edit] Synchronous Graphics RAM (SGRAM)
SGRAM is a specialized form of SDRAM for graphics adaptors. It adds functions such as bit masking (writing to a specified bit plane without affecting the others) and block write (filling a block of memory with a single colour). Unlike VRAM and WRAM, SGRAM is single-ported. However, it can open two memory pages at once, which simulates the dual-port nature of other video RAM technologies.
SGRAM and SDRAM became the most popular types of DRAM at the end of the 1990s, and well into the first decade of the 2000s.
[edit] Synchronous Dynamic RAM (SDRAM)
Single Data Rate (SDR) SDRAM is a synchronous form of DRAM.
[edit] Direct Rambus DRAM (DRDRAM)
Direct RAMBUS DRAM (DRDRAM).....
[edit] Double Data Rate (DDR) SDRAM
Double data rate (DDR) SDRAM was a later development of SDRAM, used in PC memory beginning in 2000. DDR2 SDRAM was originally seen as a minor enhancement (based upon the industry standard single-core CPU) on DDR SDRAM that mainly afforded higher clock speeds and somewhat deeper pipelining. However, with the introduction and rapid acceptance of the multi-core CPU in 2006, it is generally expected in the industry that DDR2 will revolutionize the existing physical DDR-SDRAM standard. Further, with the development and anticipated introduction of DDR3 SDRAM in 2007, it is anticipated DDR3 will rapidly replace the more limited DDR and newer DDR2.
[edit] Pseudostatic RAM (PSRAM)
PSRAM or PSDRAM is dynamic RAM with built-in refresh and address-control circuitry to make it behave similarly to static RAM (SRAM). It combines the high density of DRAM with the ease of use of true SRAM.
Some DRAM components have a "self-refresh mode". While this involves much of the same logic that is needed for pseudo-static operation, this mode is often equivalent to a standby mode. It is provided primarily to allow a system to suspend operation of its DRAM controller to save power without losing data stored in DRAM, not to allow operation without a separate DRAM controller as is the case with PSRAM.
An embedded variant of pseudostatic RAM is sold by MoSys under the name 1T-SRAM. It is technically DRAM, but behaves much like SRAM. It is used in Nintendo Gamecube and Wii consoles.
[edit] 1T DRAM
Unlike all of the other variants described here, 1T DRAM is actually a different way of constructing the basic DRAM bit cell. 1T DRAM is a "capacitorless" bit cell design that stores data in the parasitic body capacitor that is an inherent part of Silicon on Insulator transistors. Considered a nuisance in logic design, this floating body effect can be used for data storage. Although refresh is still required, reads are non-destructive; the stored charge causes a detectable shift in the threshold voltage of the transistor.Sallese, Jean-Michel (2002-06-20). "Principles of the 1T Dynamic Access Memory Concept on SOI". MOS Modeling and Parameter Extraction Group Meeting. Retrieved on 2007-10-07.
1T DRAM is commercialized under the name Z-RAM.
Note that classic one-transistor/one-capacitor (1T/1C) DRAM cell is also sometimes referred to as "1T DRAM".
[edit] RLDRAM
Reduced Latency DRAM is a high speed double data rate (DDR) SDRAM that combines fast, random access with high bandwidth. RLDRAM is mainly designed for networking and caching applications.
[edit] Security
Despite dynamic memory requiring power and refreshments to maintain its data with negligible error, the data is still retained until the memory cell capacitors are discharged, which is not automatic. Over a period of time (ranging from seconds to minutes), dependent on the properties of the semiconductor and temperature, the data will decay and eventually be lost.[6]
This property can be used to recover "secure" data kept in memory by quickly rebooting the computer and dumping the contents of the RAM or by cooling the chips and transferring them to a different computer. Such an attack was demonstrated to circumvent Microsoft's BitLocker Drive Encryption[6].
[edit] See also
* DRAM price fixing
* DIMM
* Flash memory
* Regenerative capacitor memory
* Static random access memory
* List of device bandwidths
[edit] References
1. ^ Toshiba "Toscal" BC-1411 Desktop Calculator
2. ^ http://parts.jpl.nasa.gov/docs/DRAM_Indiv-00.pdf
3. ^ d47b
4. ^ cmx1024-3200.ai
5. ^ http://www.corsairmemory.com/corsair/products/specs/twinx1024-3200xl.pdf
6. ^ a b Center for Information Technology Policy » Lest We Remember: Cold Boot Attacks on Encryption Keys. 080222 citp.princeton.edu
[edit] External links
* Basic DRAM operation has some interesting historical trend charts of cell size and DRAM density from 1995.
* Back to Basics - Memory, part 3
* Benefits of Chipkill-Correct ECC for PC Server Main Memory - A 1997 discussion of SDRAM reliability - some interesting information on "soft errors" from cosmic rays, especially with respect to Error-correcting code schemes
* a Tezzaron Semiconductor Soft Error White Paper 1994 literature review of memory error rate measurements.
* Soft errors' impact on system reliability - Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004
* Scaling and Technology Issues for Soft Error Rates - A Johnston - 4th Annual Research Conference on Reliability Stanford University, October 2000
* Challenges and future directions for the scaling of dynamic random-access memory (DRAM) - J. A. Mandelman, R. H. Dennard, G. B. Bronner, J. K. DeBrosse, R. Divakaruni, Y. Li, and C. J. Radens, IBM 2002
* Ars Technica: RAM Guide
* Versatile DRAM interface for the 6502 CPU
* David Tawei Wang (2005). "Modern DRAM Memory Systems: Performance Analysis and a High Performance, Power-Constrained DRAM-Scheduling Algorithm". PhD thesis, University of Maryland, College Park. Retrieved on 2007-03-10. A detailed description of current DRAM technology.
* The Toshiba "Toscal" BC-1411 Desktop Calculator - An early electronic calculator that uses a form of dynamic RAM built from discrete components.
* Mitsubishi's 3D-RAM And Cache DRAM incorporate high-speed, on-board SRAM cache
* Multi-port Cache DRAM - MP-RAM
* DRAM timings explained
Retrieved from "http://en.wikipedia.org/wiki/Dynamic_random_access_memory"
Categories: Computer memory | History of computing hardware
Hidden categories: Articles with sections needing expansion | All articles with unsourced statements | Articles with unsourced statements since February 2007 | Cleanup from November 2006 | All pages needing cleanup | Cleanup from May 2007
Views
* Article
* Discussion
* Edit this page
* History
Volatile
* DRAM, e.g. DDR SDRAM
* SRAM
* Upcoming
o Z-RAM
o TTRAM
* Historical
o Williams tube
o Delay line memory
Non-volatile
* ROM
o PROM
o EAROM
o EPROM
o EEPROM
* Flash memory
* Upcoming
o FeRAM
o MRAM
o Memristor
o PRAM
o SONOS
o RRAM
o Racetrack memory
o NRAM
* Historical
o Drum memory
o Magnetic core memory
o Bubble memory
o Twistor memory
Dynamic random access memory (DRAM) is a type of random access memory that stores each bit of data in a separate capacitor within an integrated circuit. Since real capacitors leak charge, the information eventually fades unless the capacitor charge is refreshed periodically. Because of this refresh requirement, it is a dynamic memory as opposed to SRAM and other static memory.
The advantage of DRAM is its structural simplicity: only one транзистор and a конденсатор are required per bit, compared to six transistors in SRAM. This allows DRAM to reach very high density. Like SRAM, it is in the class of volatile memory devices, since it loses its data when the power supply is removed. Unlike SRAM however, data may still be recovered for a short time after power-off.
Contents
[hide]
* 1 History
* 2 Operation principle
o 2.1 Memory timing
* 3 Errors and error correction
* 4 DRAM packaging
o 4.1 General DRAM packaging formats
o 4.2 Common DRAM modules
* 5 Variations
o 5.1 Asynchronous DRAM
o 5.2 Video DRAM (VRAM)
o 5.3 Fast Page Mode DRAM (FPM)
o 5.4 CAS before RAS refresh
o 5.5 Extended Data Out (EDO) DRAM
o 5.6 Burst EDO (BEDO) DRAM
o 5.7 Multibank DRAM (MDRAM)
o 5.8 Synchronous Graphics RAM (SGRAM)
o 5.9 Synchronous Dynamic RAM (SDRAM)
o 5.10 Direct Rambus DRAM (DRDRAM)
o 5.11 Double Data Rate (DDR) SDRAM
o 5.12 Pseudostatic RAM (PSRAM)
o 5.13 1T DRAM
o 5.14 RLDRAM
* 6 Security
* 7 See also
* 8 References
* 9 External links
[edit] History
This section requires expansion.
Schematic drawing of original designs of DRAM patented in 1968.
Schematic drawing of original designs of DRAM patented in 1968.
1964 Arnold Farber and Eugene Schlig working for IBM created a memory cell that was hard wired; using a transistor gate and tunnel diode latch, they later replaced the latch with two transistors and two resistors and this became known as the Farber-Schlig cell. 1965 Benjamin Agusta and his team working for IBM managed to create a 16-bit silicon chip memory cell based on the Farber-Schlig cell which consisted of 80 transistors, 64 resistors and 4 diodes. 1966 DRAM was invented by Dr. Robert Dennard at the IBM Thomas J. Watson Research Center and he was awarded U.S. patent number 3,387,286 in 1968. Capacitors had been used for earlier memory schemes such as the drum of the Atanasoff–Berry Computer, the Williams tube and the Selectron tube.
The Toshiba "Toscal" BC-1411 electronic calculator, which went into production in November 1965, uses a form of dynamic RAM built from discrete components.[1]
In 1969, Honeywell asked Intel to make a DRAM using a 3-transistor cell that they had developed. This became the Intel 1102 (1024x1) in early 1970. However the 1102 had many problems, prompting Intel to begin work on their own improved design (secretly to avoid conflict with Honeywell). This became the first commercially available 1-transistor cell DRAM, the Intel 1103 (1024x1) in October 1970 (despite initial problems with low yield, until the 5th revision of the masks).
The first DRAM with multiplexed row/column address lines was the Mostek MK4096 (4096x1) in 1973. Mostek held an 85% market share of the dynamic random access memory (DRAM) memory chip market worldwide, until being eclipsed by Japanese DRAM manufacturers who offered equivalent chips at lower prices.
[edit] Operation principle
Principle of operation of DRAM read, for simple 4 by 4 array.
Principle of operation of DRAM read, for simple 4 by 4 array.
Principle of operation of DRAM write, for simple 4 by 4 array.
Principle of operation of DRAM write, for simple 4 by 4 array.
DRAM is usually arranged in a square array of one capacitor and transistor per cell. The illustrations to the right show a simple example with only 4 by 4 cells (modern DRAM can be thousands of cells in length/width).
The long lines connecting each row are known as word lines. Each column is actually composed of two bit lines, each one connected to every other storage cell in the column. They are generally known as the + and − bit lines. A sense amplifier is essentially a pair of cross-connected inverters between the bit lines. That is, the first inverter is connected from the + bit line to the − bit line, and the second is connected from the − bit line to the + bit line. This is an example of positive feedback, and the arrangement is only stable with one bit line high and one bit line low.
To read a bit from a column, the following operations take place:
1. The sense amplifier is switched off and the bit lines are precharged to exactly matching voltages that are intermediate between high and low logic levels. The bit lines are constructed symmetrically to keep them balanced as precisely as possible.
2. The precharge circuit is switched off. Because the bit lines are very long, their capacitance will hold the precharge voltage for a brief time. This is an example of dynamic logic.
3. The selected row's word line is driven high. This connects one storage capacitor to one of the two bit lines. Charge is shared between the selected storage cell and the appropriate bit line, slightly altering the voltage on the line. Although every effort is made to keep the capacitance of the storage cells high and the capacitance of the bit lines low, capacitance is proportional to physical size, and the length of the bit lines means that the net effect is a very small perturbation of one bit line's voltage.
4. The sense amplifier is switched on. The positive feedback takes over and amplifies the small voltage difference until one bit line is fully low and the other is fully high. At this point, the column can be read.
5. At the end of a read cycle, the row values must be restored to the capacitors, which were depleted during the read: the bit line of the storage cell is also driven to full voltage (refreshed) by the action of the sense amplifier. Due to the length of the bit line, this takes significant time beyond the end of sense amplification.
To write to memory, the row is opened and a given column's sense amplifier is temporarily forced to the desired state and drives the bit line which charges the capacitor to the desired value. The amplifier will then drive the bit lines to the desired state and hold it stable even after the forcing is removed.During a write to a particular cell, the entire row is read out, one value changed, and then the entire row is written back in, as illustrated in the figure to the right.
Typically, manufacturers specify that each row should be refreshed every 64 ms or less, according to the JEDEC (Foundation for developing Semiconductor Standards) standard. Refresh logic is commonly used with DRAMs to automate the periodic refresh. This makes the circuit more complicated, but this drawback is usually outweighed by the fact that DRAM is much cheaper and of greater capacity than SRAM. Some systems refresh every row in a tight loop that occurs once every 64 ms. Other systems refresh one row at a time -- for example, a system with 213 = 8192 rows would require a refresh rate of one row every 7.8 µs (64 ms / 8192 rows). A few real-time systems refresh a portion of memory at a time based on an external timer that governs the operation of the rest of the system, such as the vertical blanking interval that occurs every 10 to 20 ms in video equipment. All methods require some sort of counter to keep track of which row is the next to be refreshed. Some DRAM chips include that counter; other kinds require external refresh logic to hold that counter. (Under some conditions, most of the data in DRAM can be recovered even if the DRAM has not been refreshed for several minutes.[2])
[edit] Memory timing
There are many numbers required to describe the speed of DRAM operation. Here are some examples for two speed grades of asynchronous DRAM, from a data sheet published in 1998:[3]
"50 ns" "60 ns" Description
tRC 84 ns 104 ns Random read or write cycle time (from one full /RAS cycle to another)
tRAC 50 ns 60 ns Access time: /RAS low to valid data out
tRCD 11 ns 14 ns /RAS low to /CAS low time
tRAS 50 ns 60 ns /RAS pulse width (minimum /RAS low time)
tRP 30 ns 40 ns /RAS precharge time (minimum /RAS high time)
tPC 20 ns 25 ns Page-mode read or write cycle time (/CAS to /CAS)
tAA 25 ns 30 ns Access time: Column address valid to valid data out (includes address setup time before /CAS low)
tCAC 13 ns 15 ns Access time: /CAS low to valid data out
tCAS 8 ns 10 ns /CAS low pulse width minimum
Thus, the generally quoted number is the /RAS access time. This is the time to read a random bit from a precharged DRAM array. The time to read additional bits from an open page is much less.
When such a RAM is accessed by clocked logic, the times are generally rounded up to the nearest clock cycle. For example, when accessed by a 100 MHz state machine (i.e. a 10 ns clock), the 50 ns DRAM can perform the first read in 5 clock cycles, and additional reads within the same page every 2 clock cycles. This was generally described as "5-2-2-2" timing, as bursts of 4 reads within a page were common.
When describing synchronous memory, timing is also described by clock cycle counts separated by hyphens, but the numbers have very different meanings! These numbers represent tCAS-tRCD-tRP-tRAS in multiples of the DRAM clock cycle time. Note that this is half of the data transfer rate when double data rate signaling is used. JEDEC standard PC3200 timing is 3-4-4-8[4] with a 200 MHz clock, while premium-priced high-speed PC3200 DDR DRAM DIMM might be operated at 2-2-2-5 timing.[5]
Cycles time Cycles time Description
tCL 3 15 ns 2 10 ns /CAS low to valid data out (equivalent to tCAC)
tRCD 4 20 ns 2 10 ns /RAS low to /CAS low time
tRP 4 20 ns 2 10 ns /RAS precharge time (minimum precharge to active time)
tRAS 8 40 ns 5 25 ns Minimum row active time (minimum active to precharge time)
It is worth noting that the improvement over 10 years is not that large. Minimum random access time has improved from 50 ns to tRCD + tCL = 35 ns, and even the premium 20 ns variety is only 2.5× better. However, the DDR memory does achieve 8 times higher bandwidth; due to internal pipelining and wide data paths, it can output one word every 2.5 ns, while the EDO DRAM can only output one word per tPC = 20 ns.
[edit] Errors and error correction
Main article: ECC memory#Error-correcting memory
Electrical or magnetic interference inside a computer system can cause a single bit of DRAM to spontaneously flip to the opposite state. Some research has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of cosmic rays, which may change the contents of one or more memory cells, or interfere with the circuitry used to read/write them. There is some concern that as DRAM density increases further, and thus the components on DRAM chips get smaller, whilst at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently - since lower energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse this trend.
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an ECC. Parity allows the detection of a single-bit error (actually, any odd number of wrong bits). The most common error correcting code, Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.
Error detection and correction in computer systems seems to go in and out of fashion. Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600.[1] He included parity in the CDC 7600, and reputedly said "I learned that a lot of farmers buy computers." 486-era PCs often used parity.[citation needed] Pentium-era ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Current microprocessor memory controllers generally support ECC but most non-server systems do not use these features. Even if they do, it is not clear that the software layers do their part.
Memory controllers in most modern PCs can typically detect, and correct errors of a single bit per 64 bit "word" (the unit of bus transfer), and detect (but not correct) errors of two bits per 64 bit word. Some systems also 'scrub' the errors, by writing the corrected version back to memory. The BIOS in some computers, and operating systems such as Linux, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic. Unfortunately, most modern PCs are supplied with memory modules that have no parity or ECC bits.
Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness does not seem to be widely addressed; one exception is Chipkill.
A reasonable rule of thumb is to expect one bit error, per month, per gigabyte of memory. Actual error rates vary widely.[ecc]
[edit] DRAM packaging
For economic reasons, the large (main) memories found in personal computers, workstations, and non-handheld game-consoles (such as Playstation and Xbox) normally consists of dynamic RAM (DRAM). Other parts of the computer, such as cache memories and data buffers in hard disks, normally use static RAM (SRAM).
[edit] General DRAM packaging formats
DDR2 SDRAM packages
DDR2 SDRAM packages
Common DRAM packages
Common DRAM packages
EDO DRAM memory module
EDO DRAM memory module
Dynamic random access memory is produced as integrated circuits (ICs) bonded and mounted into plastic packages with metal pins for connection to control signals and buses. Today, these DRAM packages are in turn often assembled into plug-in modules for easier handling. Some standard module types are:
* DRAM chip (Integrated Circuit or IC)
o Dual in-line Package (DIP)
* DRAM (memory) modules
o Single In-line Pin Package (SIPP)
o Single In-line Memory Module (SIMM)
o Dual In-line Memory Module (DIMM)
o Rambus In-line Memory Module (RIMM), technically DIMMs but called RIMMs due to their proprietary slot.
o Small outline DIMM (SO-DIMM), about half the size of regular DIMMs, are mostly used in notebooks, small footprint PCs (such as Mini-ITX motherboards), upgradable office printers and networking hardware like routers. Comes in versions with:
+ 72 pins (32-bit)
+ 144 pins (64-bit)
+ 200 pins (72-bit)
o Small outline RIMM (SO-RIMM). Smaller version of the RIMM, used in laptops. Technically SO-DIMMs but called SO-RIMMs due to their proprietary slot.
* Stacked v. non-stacked RAM modules
o Stacked RAM modules contain two or more RAM chips stacked on top of each other. This allows large modules (like 512mb or 1Gig SO-DIMM) to be manufactured using cheaper low density wafers. Stacked chip modules draw more power.
[edit] Common DRAM modules
Common DRAM packages as illustrated to the right, from top to bottom:
1. DIP 16-pin (DRAM chip, usually pre-FPRAM)
2. SIPP (usually FPRAM)
3. SIMM 30-pin (usually FPRAM)
4. SIMM 72-pin (so-called "PS/2 SIMM", usually EDO RAM)
5. DIMM 168-pin (SDRAM)
6. DIMM 184-pin (DDR SDRAM)
7. RIMM 184-pin
8. DIMM 240-pin (DDR2 SDRAM/DDR3 SDRAM)
[edit] Variations
DRAM types
* FPM RAM
* EDO RAM
* Burst EDO RAM
* SDRAM
o SDR SDRAM
o DDR SDRAM
o DDR2 SDRAM
o DDR3 SDRAM
o Rambus RAM
+ XDR DRAM
+ XDR2 DRAM
o VC-RAM
* Video RAM
o WRAM
* SGRAM
* GDDR2
* GDDR3
* GDDR4
* GDDR5
This article may require cleanup to meet Wikipedia's quality standards.
Please improve this article if you can. (November 2006)
[edit] Asynchronous DRAM
This is the basic form, from which all others are derived. An asynchronous DRAM chip has power connections, some number of address inputs (typically 12), and a few (typically 1 or 4) bidirectional data lines. There are four active low control signals:
* /RAS, the Row Address Strobe. The address inputs are captured on the falling edge of /RAS, and select a row to open. The row is held open as long as /RAS is low.
* /CAS, the Column Address Strobe. The address inputs are captured on the falling edge of /CAS, and select a column from the currently open row to read or write.
* /WE, Write Enable. This signal determines whether a given falling edge of /CAS is a read (if high) or write (if low). If low, the data inputs are also captured on the falling edge of /CAS.
* /OE, Output Enable. This is an additional signal that controls output to the data I/O pins. The data pins are driven by the DRAM chip if /RAS and /CAS are low, and /WE is high, and /OE is low. In many applications, /OE can be permanently connected low (output always enabled), but it can be useful when connecting multiple memory chips in parallel.
This interface provides direct control of internal timing. When /RAS is driven low, a /CAS cycle must not be attempted until the sense amplifiers have sensed the memory state, and /RAS must not be returned high until the storage cells have been refreshed. When /RAS is driven high, it must be held high long enough for precharging to complete.
[edit] Video DRAM (VRAM)
This article or section may require restructuring to meet Wikipedia's quality standards.
Please discuss this issue on the talk page. This article has been tagged since May 2007.
VRAM is a dual-ported variant of DRAM which was once commonly used to store the frame-buffer in some graphics adaptors.
It was invented in 1980 by F. Dill and R. Matick at IBM Research in 1980, with a patent issued in 1985 (US Patent 4,541,075). The first commercial use of VRAM was in the high resolution graphics adapter introduced in 1986 by IBM with the PC/RT system.
VRAM has two sets of data output pins, and thus two ports that can be used simultaneously. The first port, the DRAM port, is accessed by the host computer in a manner very similar to traditional DRAM. The second port, the video port, is typically read-only and is dedicated to providing a high-speed data channel for the graphics chipset.
Typical DRAM arrays normally access a full row of bits (i.e. a word line) at up to 1024 bits at one time, but only use one or a few of these for actual data, the remainder being discarded. Since DRAM cells are destructively read, each bit accessed must be sensed, and re-written. Thus, typically, 1024 sense amplifiers are typically used. VRAM operates by not discarding the excess bits which must be accessed, but making full use of them in a simple way. If each horizontal scan line of a display is mapped to a full word, then upon reading one word and latching all 1024 bits into a separate row buffer, these bits can subsequently be serially streamed to the display circuitry. This will leave access to the DRAM array free to be accessed (read or write) for many cycles, until the row buffer is almost depleted. A complete DRAM read cycle is only required to fill the row buffer, leaving most DRAM cycles available for normal accesses.
Such operation is described in the paper "All points addressible raster display memory" by R. Matick, D. Ling, S. Gupta, and F. Dill, IBM Journal of R&D, Vol 28, No. 4, July 1984, pp379-393. To use the video port, the controller first uses the DRAM port to select the row of the memory array that is to be displayed. The VRAM then copies that entire row to an internal row-buffer which is a shift-register. The controller can then continue to use the DRAM port for drawing objects on the display. Meanwhile, the controller feeds a clock called the shift clock (SCLK) to the VRAM's video port. Each SCLK pulse causes the VRAM to deliver the next item of data, in strict address order, from the shift-register to the video port. For simplicity, the graphics adapter is usually designed so that the contents of a row, and therefore the contents of the shift-register, corresponds to a complete horizontal line on the display.
In the late 1990s, standard DRAM technologies (e.g. SDRAM) became cheap, dense, and fast enough to completely displace VRAM, even though it was only single-ported and some memory bits were wasted.
[edit] Fast Page Mode DRAM (FPM)
A 256Kx4 DRAM on an early PC memory card
A 256Kx4 DRAM on an early PC memory card
Fast page mode DRAM is also called FPM DRAM, Page mode DRAM, Fast page mode memory, or Page mode memory.
In page mode, a row of the DRAM can be kept "open" by holding /RAS low while performing multiple reads or writes with separate pulses of /CAS. so that successive reads or writes within the row do not suffer the delay of precharge and accessing the row. This increases the performance of the system when reading or writing bursts of data.
Static column is a variant of page mode in which the column address does not need to be strobed in, but rather, the address inputs may be changed with /CAS held low, and the data output will be updated accordingly a few nanoseconds later.
Nibble mode is another variant in which four sequential locations within the row can be accessed with four consecutive pulses of /CAS. The difference from normal page mode is that the address inputs are not used for the second through fourth /CAS edges; they are generated internally starting with the address supplied for the first /CAS edge.
[edit] CAS before RAS refresh
Classic asynchronous DRAM is refreshed by opening each row in turn. This can be done by supplying a row address and pulsing /RAS low; it is not necessary to perform any /CAS cycles. An external counter is needed to iterate over the row addresses in turn.
For convenience, the counter was quickly incorporated into RAM chips themselves. If the /CAS line is driven low before /RAS (normally an illegal operation), then the DRAM ignores the address inputs and uses an internal counter to select the row to open. This is known as /CAS-before-/RAS (CBR) refresh.
===
It was used by Matrox on both their MGA Millennium and Millennium II graphics cards, and by Nintendo in their Game Boy Advance range.
[edit] Extended Data Out (EDO) DRAM
A pair of 32 MiB EDO DRAM modules.
A pair of 32 MiB EDO DRAM modules.
EDO DRAM is similar to Fast Page Mode DRAM with the additional feature that a new access cycle can be started while keeping the data output of the previous cycle active. This allows a certain amount of overlap in operation (pipelining), allowing somewhat improved speed. It was 5% faster than Fast Page Mode DRAM, which it began to replace in 1993.
To be precise, EDO DRAM begins data output on the falling edge of /CAS, but does not stop the output when /CAS rises again. It holds the output valid (thus extending the data output time) until either /RAS is deasserted, or a new /CAS falling edge selects a different column address.
Single-cycle EDO has the ability to carry out a complete memory transaction in one clock cycle. Otherwise, each sequential RAM access within the same page takes two clock cycles instead of three, once the page has been selected. EDO's speed and capabilities allowed it to somewhat replace the then-slow L2 caches of PCs. It created an opportunity to reduce the immense performance loss associated with a lack of L2 cache, while making systems cheaper to build. This was also good for notebooks due to difficulties with their limited form factor, and battery life limitations. An EDO system with L2 cache was tangibly faster than the older FPM/L2 combination.
Single-cycle EDO DRAM became very popular on video cards towards the end of the 1990s. It was very low cost, yet nearly as efficient for performance as the far more costly VRAM.
EDO was sometimes referred to as Hyper Page Mode.
[edit] Burst EDO (BEDO) DRAM
An evolution of the former, Burst EDO DRAM, could process four memory addresses in one burst, for a maximum of 5-1-1-1, saving an additional three clocks over optimally designed EDO memory. It was done by adding an address counter on the chip to keep track of the next address. BEDO also added a pipelined stage allowing page-access cycle to be divided into two components. During a memory-read operation, the first component accessed the data from the memory array to the output stage (second latch). The second component drove the data bus from this latch at the appropriate logic level. Since the data is already in the output buffer, faster access time is achieved (up to 50% for large blocks of data) than with traditional EDO.
Although BEDO DRAM showed additional optimization over EDO, by the time it was available the market had made a significant investment towards synchronous DRAM, or SDRAM [2]. Even though BEDO RAM was superior to SDRAM in some ways, the latter technology gained significant traction and quickly displaced BEDO.
[edit] Multibank DRAM (MDRAM)
Multibank RAM applies the interleaving technique for main memory to second level cache memory to provide a cheaper and faster alternative to SRAM. The chip splits its memory capacity into small blocks of 256 kB and allows operations to two different banks in a single clock cycle.
This memory was primarily used in graphic cards with Tseng Labs ET6x00 chipsets, and was made by MoSys. Boards based upon this chipset often used the unusual RAM size configuration of 2.25 MiB, owing to MDRAM's ability to be implemented in various sizes more easily. This size of 2.25 MiB allowed 24-bit color at a resolution of 1024×768, a very popular display setting in the card's time.
[edit] Synchronous Graphics RAM (SGRAM)
SGRAM is a specialized form of SDRAM for graphics adaptors. It adds functions such as bit masking (writing to a specified bit plane without affecting the others) and block write (filling a block of memory with a single colour). Unlike VRAM and WRAM, SGRAM is single-ported. However, it can open two memory pages at once, which simulates the dual-port nature of other video RAM technologies.
SGRAM and SDRAM became the most popular types of DRAM at the end of the 1990s, and well into the first decade of the 2000s.
[edit] Synchronous Dynamic RAM (SDRAM)
Single Data Rate (SDR) SDRAM is a synchronous form of DRAM.
[edit] Direct Rambus DRAM (DRDRAM)
Direct RAMBUS DRAM (DRDRAM).....
[edit] Double Data Rate (DDR) SDRAM
Double data rate (DDR) SDRAM was a later development of SDRAM, used in PC memory beginning in 2000. DDR2 SDRAM was originally seen as a minor enhancement (based upon the industry standard single-core CPU) on DDR SDRAM that mainly afforded higher clock speeds and somewhat deeper pipelining. However, with the introduction and rapid acceptance of the multi-core CPU in 2006, it is generally expected in the industry that DDR2 will revolutionize the existing physical DDR-SDRAM standard. Further, with the development and anticipated introduction of DDR3 SDRAM in 2007, it is anticipated DDR3 will rapidly replace the more limited DDR and newer DDR2.
[edit] Pseudostatic RAM (PSRAM)
PSRAM or PSDRAM is dynamic RAM with built-in refresh and address-control circuitry to make it behave similarly to static RAM (SRAM). It combines the high density of DRAM with the ease of use of true SRAM.
Some DRAM components have a "self-refresh mode". While this involves much of the same logic that is needed for pseudo-static operation, this mode is often equivalent to a standby mode. It is provided primarily to allow a system to suspend operation of its DRAM controller to save power without losing data stored in DRAM, not to allow operation without a separate DRAM controller as is the case with PSRAM.
An embedded variant of pseudostatic RAM is sold by MoSys under the name 1T-SRAM. It is technically DRAM, but behaves much like SRAM. It is used in Nintendo Gamecube and Wii consoles.
[edit] 1T DRAM
Unlike all of the other variants described here, 1T DRAM is actually a different way of constructing the basic DRAM bit cell. 1T DRAM is a "capacitorless" bit cell design that stores data in the parasitic body capacitor that is an inherent part of Silicon on Insulator transistors. Considered a nuisance in logic design, this floating body effect can be used for data storage. Although refresh is still required, reads are non-destructive; the stored charge causes a detectable shift in the threshold voltage of the transistor.Sallese, Jean-Michel (2002-06-20). "Principles of the 1T Dynamic Access Memory Concept on SOI". MOS Modeling and Parameter Extraction Group Meeting. Retrieved on 2007-10-07.
1T DRAM is commercialized under the name Z-RAM.
Note that classic one-transistor/one-capacitor (1T/1C) DRAM cell is also sometimes referred to as "1T DRAM".
[edit] RLDRAM
Reduced Latency DRAM is a high speed double data rate (DDR) SDRAM that combines fast, random access with high bandwidth. RLDRAM is mainly designed for networking and caching applications.
[edit] Security
Despite dynamic memory requiring power and refreshments to maintain its data with negligible error, the data is still retained until the memory cell capacitors are discharged, which is not automatic. Over a period of time (ranging from seconds to minutes), dependent on the properties of the semiconductor and temperature, the data will decay and eventually be lost.[6]
This property can be used to recover "secure" data kept in memory by quickly rebooting the computer and dumping the contents of the RAM or by cooling the chips and transferring them to a different computer. Such an attack was demonstrated to circumvent Microsoft's BitLocker Drive Encryption[6].
[edit] See also
* DRAM price fixing
* DIMM
* Flash memory
* Regenerative capacitor memory
* Static random access memory
* List of device bandwidths
[edit] References
1. ^ Toshiba "Toscal" BC-1411 Desktop Calculator
2. ^ http://parts.jpl.nasa.gov/docs/DRAM_Indiv-00.pdf
3. ^ d47b
4. ^ cmx1024-3200.ai
5. ^ http://www.corsairmemory.com/corsair/products/specs/twinx1024-3200xl.pdf
6. ^ a b Center for Information Technology Policy » Lest We Remember: Cold Boot Attacks on Encryption Keys. 080222 citp.princeton.edu
[edit] External links
* Basic DRAM operation has some interesting historical trend charts of cell size and DRAM density from 1995.
* Back to Basics - Memory, part 3
* Benefits of Chipkill-Correct ECC for PC Server Main Memory - A 1997 discussion of SDRAM reliability - some interesting information on "soft errors" from cosmic rays, especially with respect to Error-correcting code schemes
* a Tezzaron Semiconductor Soft Error White Paper 1994 literature review of memory error rate measurements.
* Soft errors' impact on system reliability - Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004
* Scaling and Technology Issues for Soft Error Rates - A Johnston - 4th Annual Research Conference on Reliability Stanford University, October 2000
* Challenges and future directions for the scaling of dynamic random-access memory (DRAM) - J. A. Mandelman, R. H. Dennard, G. B. Bronner, J. K. DeBrosse, R. Divakaruni, Y. Li, and C. J. Radens, IBM 2002
* Ars Technica: RAM Guide
* Versatile DRAM interface for the 6502 CPU
* David Tawei Wang (2005). "Modern DRAM Memory Systems: Performance Analysis and a High Performance, Power-Constrained DRAM-Scheduling Algorithm". PhD thesis, University of Maryland, College Park. Retrieved on 2007-03-10. A detailed description of current DRAM technology.
* The Toshiba "Toscal" BC-1411 Desktop Calculator - An early electronic calculator that uses a form of dynamic RAM built from discrete components.
* Mitsubishi's 3D-RAM And Cache DRAM incorporate high-speed, on-board SRAM cache
* Multi-port Cache DRAM - MP-RAM
* DRAM timings explained
Retrieved from "http://en.wikipedia.org/wiki/Dynamic_random_access_memory"
Categories: Computer memory | History of computing hardware
Hidden categories: Articles with sections needing expansion | All articles with unsourced statements | Articles with unsourced statements since February 2007 | Cleanup from November 2006 | All pages needing cleanup | Cleanup from May 2007
Views
* Article
* Discussion
* Edit this page
* History
воскресенье, 11 мая 2008 г.
Virtual memory
Virtual memory is a computer system technique which gives an application program the impression that it has contiguous working memory, while in fact may be physically fragmented and may even overflow on to disk storage. Systems that use this technique make programming of large applications easier and use real physical memory (e.g. RAM) more efficiently than those without virtual memory.
Note that "virtual memory" is not just "using disk space to extend physical memory size". Extending memory is a normal consequence of using virtual memory techniques, but can be done by other means such as overlays or swapping programs and their data completely out to disk while they are inactive. The definition of "virtual memory" is based on tricking programs into thinking they are using large blocks of contiguous addresses.
All modern general-purpose computer operating systems use virtual memory techniques for ordinary applications, such as word processors, spreadsheets, multimedia players, accounting, etc. Few older operating systems, such as DOS of the 1980s, or those for the mainframes of the 1960s, had virtual memory functionality - notable exceptions being the Atlas and B5000.
Embedded systems and other special-purpose computer systems which require very fast, very consistent response time do not generally use virtual memory.
Contents
[hide]
* 1 Implementation techniques
o 1.1 Paged virtual memory
+ 1.1.1 Page tables
+ 1.1.2 Paging
+ 1.1.3 Dynamic address translation
+ 1.1.4 Paging supervisor
+ 1.1.5 Permanently resident pages
# 1.1.5.1 Virtual=real operation
o 1.2 Segmented virtual memory
* 2 Avoiding thrashing
* 3 History
* 4 See also
* 5 References
* 6 External links
[edit] Implementation techniques
[edit] Paged virtual memory
Almost all implementations of virtual memory divide the virtual address space of an application program into pages; a page is a block of contiguous virtual memory addresses. Pages are usually at least 4K bytes in size, and systems with large virtual address ranges or large amounts of real memory (e.g. RAM) generally use larger page sizes.
[edit] Page tables
Almost all implementations use page tables to translate the virtual addresses seen by the application program into physical addresses (also referred to as "real addresses") used by the hardware to process instructions. Each entry in a page table contains: the starting virtual address of the page; either the real memory address at which the page is actually stored or an indicator that the page is currently held in a disk file (if the system uses disk files to let applications use amounts of virtual memory which exceed real memory).
Systems can have one page table for the whole system or a separate page table for each application. If there is only one, different applications which are running at the same time share a single virtual address space, i.e. they use different parts of a single range of virtual addresses. Systems which use multiple page tables provide multiple virtual address spaces - concurrent applications think they are using the same range of virtual addresses, but their separate page tables redirect to different real addresses.
[edit] Paging
Paging is the process of saving inactive virtual memory pages to disk and restoring them to real memory when required.
Most virtual memory systems enable programs to use virtual address ranges which in total exceed the amount of real memory (e.g. RAM). To do this they use disk files to save virtual memory pages which are not currently active, and restore them to real memory when they are needed. Pages are not necessarily restored to the same real addresses from which they were saved - applications are aware only of virtual addresses. Usually the real memory to which a page is restored contains another virtual memory page which has been used recently, and which must therefore be saved to disk.
[edit] Dynamic address translation
When a CPU fetches an instruction located at a particular virtual address or, while executing an instruction, fetches data from a particular virtual address or stores data to a particular virtual address, the virtual address must be translated to the corresponding physical address. This is done by a hardware component, sometimes called a memory management unit, which looks up the real address (from the page table) corresponding to a virtual address and passes the real address to the parts of the CPU which execute instructions. If the page tables indicate that the virtual memory page is not currently in real memory, the hardware raises a page fault exception (special internal signal) which invokes the paging supervisor component of the operating system
Paging supervisor
This part of the operating system creates and manages the page tables. If the dynamic address translation hardware raises a page fault exception, the paging supervisor searches the page file(s) (on disk) for the page containing the required virtual address, reads it into real physical memory, updates the page tables to reflect the new location of the virtual address and finally tells the dynamic address translation mechanism to start the search again. Usually all of the real physical memory is already in use and the paging supervisor must first save an area of real physical memory to disk and update the page table to say that the associated virtual addresses are no longer in real physical memory but saved on disk. Paging supervisors generally save and overwrite areas of real physical memory which have been least recently used, because these are probably the areas which are used least often. So every time the dynamic address translation hardware matches a virtual address with a real physical memory address, it must put a time-stamp in the page table entry for that virtual address.
[edit] Permanently resident pages
All virtual memory systems have memory areas that are "pinned down", i.e. cannot be swapped out to secondary storage, for example:
* Interrupt mechanisms generally rely on an array of pointers to the handlers for various types of interrupt (I/O completion, timer event, program error, page fault, etc.). If the pages containing these pointers or the code that they invoke were pageable, interrupt-handling would become even more complex and time-consuming; and it would be especially difficult in the case of page fault interrupts.
* The page tables are usually not pageable.
* Data buffers that are accessed outside of the CPU, for example by peripheral devices that use direct memory access (DMA) or by I/O channels. Usually such devices and the buses (connection paths) to which they are attached use physical memory addresses rather than virtual memory addresses. Even on buses with an IOMMU, which is a special memory management unit that can translate virtual addresses used on an I/O bus to physical addresses, the transfer cannot be stopped if a page fault occurs and then restarted when the page fault has been processed. So pages containing locations to which or from which a peripheral device is transferring data are either permanently pinned down or pinned down while the transfer is in progress.
* Any other kernel or application areas in which operation is very timing dependent and cannot allow the variation in response time which paging causes.
[edit] Virtual=real operation
In MVS, z/OS, and similar OSes, some parts of the systems memory are managed in virtual=real mode, where every virtual address corresponds to a real address. Those are:
* interrupt mechanisms
* paging supervisor and page tables
* all data buffers accessed by I/O channels[citation needed]
* application programs which use non-standard methods of managing I/O and therefore provide their own buffers and communicate directly with peripherals (programs that create their own channel command words).
In IBM's early virtual memory operating systems virtual=real mode was the only way to "pin down" pages. z/OS has 3 modes, V=V (virtual=virtual; fully pageable), V=R and V=F (virtual = fixed, i.e. "pinned down" but with DAT operating).[1]
[edit] Segmented virtual memory
Some systems, such as the Burroughs large systems, do not use paging to implement virtual memory. Instead, they use segmentation, so that an application's virtual address space is divided into variable-length segments. A virtual address consists of a segment number and an offset within the segment.
Memory is still physically addressed with a single number (called absolute or linear address). To obtain it, the processor looks up the segment number in a segment table to find a segment descriptor.[2] The segment descriptor contains a flag indicating whether the segment is present in main memory and, if it is, the address in main memory of the beginning of the segment (segment's base address) and the length of the segment. It is checked whether the offset within the segment is less than the length of the segment and, if it isn't, an interrupt is generated. If a segment is not present in main memory, hardware interrupt is raised to the operating system, which may try to read the segment into main memory, or to swap in. The operating system might have to remove other segments (swap out) from main memory in order to make room in main memory for the segment to be read in.
Notably, the Intel 80286 supported similar segmentation scheme as an option, but it was unused by most operating systems.
It is possible to combine segmentation and paging, usually dividing each segment into pages. In systems that combine them, such as Multics and the IBM System/38 and IBM System i machines, virtual memory is usually implemented with paging, with segmentation used to provide memory protection.[3][4][5] With the Intel 80386 and later IA-32 processors, the segments reside in a 32-bit linear paged address space, so segments can be moved into and out of that linear address space, and pages in that linear address space can be moved in and out of main memory, providing two levels of virtual memory; however, few if any operating systems do so. Instead, they only use paging.
The difference between virtual memory implementations using pages and using segments is not only about the memory division with fixed and variable sizes, respectively. In some systems, e.g. Multics, or later System/38 and Prime machines, the segmentation was actually visible to the user processes, as part of the semantics of a memory model. In other words, instead of a process just having a memory which looked like a single large vector of bytes or words, it was more structured. This is different from using pages, which doesn't change the model visible to the process. This had important consequences.
Segment wasn't just a "page with a variable length", or a simple way to lengthen the address space (as in Intel 80286). In Multics, the segmentation was a very powerful mechanism that was used to provide a single-level virtual memory model, in which there was no differentiation between "process memory" and "file system" - a process' active address space consisted only a list of segments (files) which were mapped into its potential address space, both code and data. [6] It is not the same as the later mmap function in Unix, because inter-file pointers don't work when mapping files into semi-arbitrary places. Multics had such addressing mode built into most instructions. In other words it could perform relocated inter-segment references, thus eliminating the need for a linker completely.[7] This also worked when different processes mapped the same file into different places in their private address spaces.[8]
[edit] Avoiding thrashing
All implementations need to avoid a problem called "thrashing", where the computer spends too much time shuffling blocks of virtual memory between real memory and disks, and therefore appears to work slower. Better design of application programs can help, but ultimately the only cure is to install more real memory. For more information see Paging.
[edit] History
In the 1940s and 1950s, before the development of a virtual memory, all larger programs had to contain logic for managing two-level storage (primary and secondary, today's analogies being RAM and hard disk), such as overlaying techniques. Programs were responsible for moving overlays back and forth from secondary storage to primary.
The main reason for introducing virtual memory was therefore not simply to extend primary memory, but to make such an extension as easy to use for programmers as possible.[7]
Many systems already had the ability to divide the memory between multiple programs (required for multiprogramming and multiprocessing), provided for example by "base and bounds registers" on early models of the PDP-10, without providing virtual memory. That gave each application a private address space starting at an address of 0, with an address in the private address space being checked against a bounds register to make sure it's within the section of memory allocated to the application and, if it is, having the contents of the corresponding base register being added to it to give an address in main memory. This is a simple form of segmentation without virtual memory.
Virtual memory was developed in approximately 1959–1962, at the University of Manchester for the Atlas Computer, completed in 1962.[9] However, Fritz-Rudolf Güntsch, one of Germany's pioneering computer scientists and later the developer of the Telefunken TR 440 mainframe, claims to have invented the concept in 1957 in his doctoral dissertation Logischer Entwurf eines digitalen Rechengerätes mit mehreren asynchron laufenden Trommeln und automatischem Schnellspeicherbetrieb (Logic Concept of a Digital Computing Device with Multiple Asynchronous Drum Storage and Automatic Fast Memory Mode).
In 1961, Burroughs released the B5000, the first commercial computer with virtual memory.[10][11] It used segmentation rather than paging.
Like many technologies in the history of computing, virtual memory was not accepted without challenge. Before it could be implemented in mainstream operating systems, many models, experiments, and theories had to be developed to overcome the numerous problems. Dynamic address translation required a specialized, expensive, and hard to build hardware, moreover initially it slightly slowed down the access to memory.[7] There were also worries that new system-wide algorithms of utilizing secondary storage would be far less effective than previously used application-specific ones.
By 1969 the debate over virtual memory for commercial computers was over.[7] An IBM research team led by David Sayre showed that the virtual memory overlay system consistently worked better than the best manually controlled systems.
Possibly the first minicomputer to introduce virtual memory was the Norwegian NORD-1. During the 1970s, other minicomputers implemented virtual memory, notably VAX models running VMS.
Virtual memory was introduced to the x86 architecture with the protected mode of the Intel 80286 processor. At first it was done with segment swapping, which became inefficient with larger segments. The Intel 80386 introduced support for paging underneath the existing segmentation layer. The page fault exception could be chained with other exceptions without causing a double fault.
[edit] See also
* Physical memory and its physical address
* Memory address
o Address space
o Virtual address space
* CPU design
* Page (computing)
o Page table
o Paging
o Working set
o Memory management unit
o Cache algorithms
o Page replacement algorithm
* Segmentation (memory)
o System/38
* Memory management
* Memory allocation
* Protected mode, a x86's name of virtual memory addressing
Note that "virtual memory" is not just "using disk space to extend physical memory size". Extending memory is a normal consequence of using virtual memory techniques, but can be done by other means such as overlays or swapping programs and their data completely out to disk while they are inactive. The definition of "virtual memory" is based on tricking programs into thinking they are using large blocks of contiguous addresses.
All modern general-purpose computer operating systems use virtual memory techniques for ordinary applications, such as word processors, spreadsheets, multimedia players, accounting, etc. Few older operating systems, such as DOS of the 1980s, or those for the mainframes of the 1960s, had virtual memory functionality - notable exceptions being the Atlas and B5000.
Embedded systems and other special-purpose computer systems which require very fast, very consistent response time do not generally use virtual memory.
Contents
[hide]
* 1 Implementation techniques
o 1.1 Paged virtual memory
+ 1.1.1 Page tables
+ 1.1.2 Paging
+ 1.1.3 Dynamic address translation
+ 1.1.4 Paging supervisor
+ 1.1.5 Permanently resident pages
# 1.1.5.1 Virtual=real operation
o 1.2 Segmented virtual memory
* 2 Avoiding thrashing
* 3 History
* 4 See also
* 5 References
* 6 External links
[edit] Implementation techniques
[edit] Paged virtual memory
Almost all implementations of virtual memory divide the virtual address space of an application program into pages; a page is a block of contiguous virtual memory addresses. Pages are usually at least 4K bytes in size, and systems with large virtual address ranges or large amounts of real memory (e.g. RAM) generally use larger page sizes.
[edit] Page tables
Almost all implementations use page tables to translate the virtual addresses seen by the application program into physical addresses (also referred to as "real addresses") used by the hardware to process instructions. Each entry in a page table contains: the starting virtual address of the page; either the real memory address at which the page is actually stored or an indicator that the page is currently held in a disk file (if the system uses disk files to let applications use amounts of virtual memory which exceed real memory).
Systems can have one page table for the whole system or a separate page table for each application. If there is only one, different applications which are running at the same time share a single virtual address space, i.e. they use different parts of a single range of virtual addresses. Systems which use multiple page tables provide multiple virtual address spaces - concurrent applications think they are using the same range of virtual addresses, but their separate page tables redirect to different real addresses.
[edit] Paging
Paging is the process of saving inactive virtual memory pages to disk and restoring them to real memory when required.
Most virtual memory systems enable programs to use virtual address ranges which in total exceed the amount of real memory (e.g. RAM). To do this they use disk files to save virtual memory pages which are not currently active, and restore them to real memory when they are needed. Pages are not necessarily restored to the same real addresses from which they were saved - applications are aware only of virtual addresses. Usually the real memory to which a page is restored contains another virtual memory page which has been used recently, and which must therefore be saved to disk.
[edit] Dynamic address translation
When a CPU fetches an instruction located at a particular virtual address or, while executing an instruction, fetches data from a particular virtual address or stores data to a particular virtual address, the virtual address must be translated to the corresponding physical address. This is done by a hardware component, sometimes called a memory management unit, which looks up the real address (from the page table) corresponding to a virtual address and passes the real address to the parts of the CPU which execute instructions. If the page tables indicate that the virtual memory page is not currently in real memory, the hardware raises a page fault exception (special internal signal) which invokes the paging supervisor component of the operating system
Paging supervisor
This part of the operating system creates and manages the page tables. If the dynamic address translation hardware raises a page fault exception, the paging supervisor searches the page file(s) (on disk) for the page containing the required virtual address, reads it into real physical memory, updates the page tables to reflect the new location of the virtual address and finally tells the dynamic address translation mechanism to start the search again. Usually all of the real physical memory is already in use and the paging supervisor must first save an area of real physical memory to disk and update the page table to say that the associated virtual addresses are no longer in real physical memory but saved on disk. Paging supervisors generally save and overwrite areas of real physical memory which have been least recently used, because these are probably the areas which are used least often. So every time the dynamic address translation hardware matches a virtual address with a real physical memory address, it must put a time-stamp in the page table entry for that virtual address.
[edit] Permanently resident pages
All virtual memory systems have memory areas that are "pinned down", i.e. cannot be swapped out to secondary storage, for example:
* Interrupt mechanisms generally rely on an array of pointers to the handlers for various types of interrupt (I/O completion, timer event, program error, page fault, etc.). If the pages containing these pointers or the code that they invoke were pageable, interrupt-handling would become even more complex and time-consuming; and it would be especially difficult in the case of page fault interrupts.
* The page tables are usually not pageable.
* Data buffers that are accessed outside of the CPU, for example by peripheral devices that use direct memory access (DMA) or by I/O channels. Usually such devices and the buses (connection paths) to which they are attached use physical memory addresses rather than virtual memory addresses. Even on buses with an IOMMU, which is a special memory management unit that can translate virtual addresses used on an I/O bus to physical addresses, the transfer cannot be stopped if a page fault occurs and then restarted when the page fault has been processed. So pages containing locations to which or from which a peripheral device is transferring data are either permanently pinned down or pinned down while the transfer is in progress.
* Any other kernel or application areas in which operation is very timing dependent and cannot allow the variation in response time which paging causes.
[edit] Virtual=real operation
In MVS, z/OS, and similar OSes, some parts of the systems memory are managed in virtual=real mode, where every virtual address corresponds to a real address. Those are:
* interrupt mechanisms
* paging supervisor and page tables
* all data buffers accessed by I/O channels[citation needed]
* application programs which use non-standard methods of managing I/O and therefore provide their own buffers and communicate directly with peripherals (programs that create their own channel command words).
In IBM's early virtual memory operating systems virtual=real mode was the only way to "pin down" pages. z/OS has 3 modes, V=V (virtual=virtual; fully pageable), V=R and V=F (virtual = fixed, i.e. "pinned down" but with DAT operating).[1]
[edit] Segmented virtual memory
Some systems, such as the Burroughs large systems, do not use paging to implement virtual memory. Instead, they use segmentation, so that an application's virtual address space is divided into variable-length segments. A virtual address consists of a segment number and an offset within the segment.
Memory is still physically addressed with a single number (called absolute or linear address). To obtain it, the processor looks up the segment number in a segment table to find a segment descriptor.[2] The segment descriptor contains a flag indicating whether the segment is present in main memory and, if it is, the address in main memory of the beginning of the segment (segment's base address) and the length of the segment. It is checked whether the offset within the segment is less than the length of the segment and, if it isn't, an interrupt is generated. If a segment is not present in main memory, hardware interrupt is raised to the operating system, which may try to read the segment into main memory, or to swap in. The operating system might have to remove other segments (swap out) from main memory in order to make room in main memory for the segment to be read in.
Notably, the Intel 80286 supported similar segmentation scheme as an option, but it was unused by most operating systems.
It is possible to combine segmentation and paging, usually dividing each segment into pages. In systems that combine them, such as Multics and the IBM System/38 and IBM System i machines, virtual memory is usually implemented with paging, with segmentation used to provide memory protection.[3][4][5] With the Intel 80386 and later IA-32 processors, the segments reside in a 32-bit linear paged address space, so segments can be moved into and out of that linear address space, and pages in that linear address space can be moved in and out of main memory, providing two levels of virtual memory; however, few if any operating systems do so. Instead, they only use paging.
The difference between virtual memory implementations using pages and using segments is not only about the memory division with fixed and variable sizes, respectively. In some systems, e.g. Multics, or later System/38 and Prime machines, the segmentation was actually visible to the user processes, as part of the semantics of a memory model. In other words, instead of a process just having a memory which looked like a single large vector of bytes or words, it was more structured. This is different from using pages, which doesn't change the model visible to the process. This had important consequences.
Segment wasn't just a "page with a variable length", or a simple way to lengthen the address space (as in Intel 80286). In Multics, the segmentation was a very powerful mechanism that was used to provide a single-level virtual memory model, in which there was no differentiation between "process memory" and "file system" - a process' active address space consisted only a list of segments (files) which were mapped into its potential address space, both code and data. [6] It is not the same as the later mmap function in Unix, because inter-file pointers don't work when mapping files into semi-arbitrary places. Multics had such addressing mode built into most instructions. In other words it could perform relocated inter-segment references, thus eliminating the need for a linker completely.[7] This also worked when different processes mapped the same file into different places in their private address spaces.[8]
[edit] Avoiding thrashing
All implementations need to avoid a problem called "thrashing", where the computer spends too much time shuffling blocks of virtual memory between real memory and disks, and therefore appears to work slower. Better design of application programs can help, but ultimately the only cure is to install more real memory. For more information see Paging.
[edit] History
In the 1940s and 1950s, before the development of a virtual memory, all larger programs had to contain logic for managing two-level storage (primary and secondary, today's analogies being RAM and hard disk), such as overlaying techniques. Programs were responsible for moving overlays back and forth from secondary storage to primary.
The main reason for introducing virtual memory was therefore not simply to extend primary memory, but to make such an extension as easy to use for programmers as possible.[7]
Many systems already had the ability to divide the memory between multiple programs (required for multiprogramming and multiprocessing), provided for example by "base and bounds registers" on early models of the PDP-10, without providing virtual memory. That gave each application a private address space starting at an address of 0, with an address in the private address space being checked against a bounds register to make sure it's within the section of memory allocated to the application and, if it is, having the contents of the corresponding base register being added to it to give an address in main memory. This is a simple form of segmentation without virtual memory.
Virtual memory was developed in approximately 1959–1962, at the University of Manchester for the Atlas Computer, completed in 1962.[9] However, Fritz-Rudolf Güntsch, one of Germany's pioneering computer scientists and later the developer of the Telefunken TR 440 mainframe, claims to have invented the concept in 1957 in his doctoral dissertation Logischer Entwurf eines digitalen Rechengerätes mit mehreren asynchron laufenden Trommeln und automatischem Schnellspeicherbetrieb (Logic Concept of a Digital Computing Device with Multiple Asynchronous Drum Storage and Automatic Fast Memory Mode).
In 1961, Burroughs released the B5000, the first commercial computer with virtual memory.[10][11] It used segmentation rather than paging.
Like many technologies in the history of computing, virtual memory was not accepted without challenge. Before it could be implemented in mainstream operating systems, many models, experiments, and theories had to be developed to overcome the numerous problems. Dynamic address translation required a specialized, expensive, and hard to build hardware, moreover initially it slightly slowed down the access to memory.[7] There were also worries that new system-wide algorithms of utilizing secondary storage would be far less effective than previously used application-specific ones.
By 1969 the debate over virtual memory for commercial computers was over.[7] An IBM research team led by David Sayre showed that the virtual memory overlay system consistently worked better than the best manually controlled systems.
Possibly the first minicomputer to introduce virtual memory was the Norwegian NORD-1. During the 1970s, other minicomputers implemented virtual memory, notably VAX models running VMS.
Virtual memory was introduced to the x86 architecture with the protected mode of the Intel 80286 processor. At first it was done with segment swapping, which became inefficient with larger segments. The Intel 80386 introduced support for paging underneath the existing segmentation layer. The page fault exception could be chained with other exceptions without causing a double fault.
[edit] See also
* Physical memory and its physical address
* Memory address
o Address space
o Virtual address space
* CPU design
* Page (computing)
o Page table
o Paging
o Working set
o Memory management unit
o Cache algorithms
o Page replacement algorithm
* Segmentation (memory)
o System/38
* Memory management
* Memory allocation
* Protected mode, a x86's name of virtual memory addressing
Flip-flop (electronics)
In digital circuits, a flip-flop is a kind of bistable multivibrator, an electronic circuit which has two stable states and thereby is capable of serving as one bit of memory. Today, the term flip-flop has come to generally denote non-transparent (clocked or edge-triggered) devices, while the simpler transparent ones are often referred to as latches.
A flip-flop is controlled by (usually) one or two control signals and/or a gate or clock signal. The output often includes the complement as well as the normal output. As flip-flops are implemented electronically, they require power and ground connections.
Contents
[hide]
* 1 History
* 2 Implementation
* 3 Set-Reset flip-flops (SR flip-flops)
* 4 Toggle flip-flops (T flip-flops)
* 5 JK flip-flop
* 6 D flip-flop
* 7 Master-slave D flip-flop
o 7.1 Edge-triggered D flip-flop
* 8 Uses
* 9 Timing and metastability
* 10 Flip-flop integrated circuits
* 11 See also
* 12 Notes and references
* 13 External links
[edit] History
The first electronic flip-flop was invented in 1919 by William Eccles and F. W. Jordan.[1] It was initially called the Eccles-Jordan trigger circuit and consisted of two active elements (radio-tubes). The name flip-flop was later derived from the sound produced on a speaker connected with one of the back coupled amplifiers output during the trigger process within the circuit. This original electronic flip-flop was transparent - i.e. a simple two-input bistable circuit without any dedicated clock (or even gate) signal - and thus would probably have been labeled as a "latch" today.
[edit] Implementation
Flip-flops can be either simple (transparent) or clocked. Simple flip-flops can be built by two cross-coupled inverting elements – transistors, or NAND, or NOR-gates – perhaps augmented by some enable/disable (gating) mechanism. Clocked devices are specially designed for synchronous (time-discrete) systems and therefore one such device ignores its inputs except at the transition of a dedicated clock signal (known as clocking, pulsing, or strobing). This causes the flip-flop to either change or retain its output signal based upon the values of the input signals at the transition. Some flip-flops change output on the rising edge of the clock, others on the falling edge.
Clocked (non-transparent) flip-flops are typically implemented as master-slave devices[2] where two basic flip-flops (plus some additional logic) collaborate to make it insensitive to spikes and noise between the short clock transitions; they nevertheless also often include asynchronous clear or set inputs which may be used to change the current output independent of the clock.
Flip-flops can be further divided into types that have found common applicability in both asynchronous and clocked sequential systems: the SR ("set-reset"), D ("delay"[3]), T ("toggle"), and JK types are the common ones; all of which may be synthesized from (most) other types by a few logic gates. The behavior of a particular type can be described by what is termed the characteristic equation, which derives the "next" (i.e., after the next clock pulse) output, Qnext, in terms of the input signal(s) and/or the current output, Q.
[edit] Set-Reset flip-flops (SR flip-flops)
Main article: SR latch
The symbol for an SR latch.
The symbol for an SR latch.
The most fundamental latch is the simple SR latch (or simple SR flip-flop), where S and R stand for set and reset. It can be constructed from a pair of cross-coupled NOR (negative OR) logic gates. The stored bit is present on the output marked Q.
Normally, in storage mode, the S and R inputs are both low, and feedback maintains the Q and Q outputs in a constant state, with Q the complement of Q. If S (Set) is pulsed high while R is held low, then the Q output is forced high, and stays high even after S returns low; similarly, if R (Reset) is pulsed high while S is held low, then the Q output is forced low, and stays low even after R returns low.
SR latch operation
S R Action
0 0 Keep state
0 1 Q = 0
1 0 Q = 1
1 1 Unstable combination,
see race condition
[edit] Toggle flip-flops (T flip-flops)
A circuit symbol for a T-type flip-flop, where > is the clock input, T is the toggle input and Q is the stored data output.
A circuit symbol for a T-type flip-flop, where > is the clock input, T is the toggle input and Q is the stored data output.
If the T input is high, the T flip-flop changes state ("toggles") whenever the clock input is strobed. If the T input is low, the flip-flop holds the previous value. This behavior is described by the characteristic equation:
Q_{next} = T \oplus Q (or, without benefit of the XOR operator, the equivalent: Q_{next} = T\overline{Q} + \overline{T}Q )
and can be described in a truth table:
T Q Qnext Comment
0 0 0 hold state(no clk)
0 1 1 hold state(no clk)
1 0 1 toggle
1 1 0 toggle
When T is held high, the toggle flip-flop divides the clock frequency by two; that is, if clock frequency is 4 MHz, the output frequency obtained from the flip-flop will be 2 MHz. This 'divide by' feature has application in various types of digital counters. A T flip-flop can also be built using a JK flip-flop (J & K pins are connected together and act as T) or D flip-flop (T input and Qprevious is connected to the D input through an XOR gate).
[edit] JK flip-flop
JK flip-flop timing diagram
JK flip-flop timing diagram
The JK flip-flop augments the behavior of the SR flip-flop by interpreting the S = R = 1 condition as a "flip" or toggle command. Specifically, the combination J = 1, K = 0 is a command to set the flip-flop; the combination J = 0, K = 1 is a command to reset the flip-flop; and the combination J = K = 1 is a command to toggle the flip-flop, i.e., change its output to the logical complement of its current value. Setting J = K = 0 does NOT result in a D flip-flop, but rather, will hold the current state. To synthesize a D flip-flop, simply set K equal to the complement of J. The JK flip-flop is therefore a universal flip-flop, because it can be configured to work as an SR flip-flop, a D flip-flop, or a T flip-flop. NOTE: The flip flop is positive edge triggered (Clock Pulse) as seen in the timing diagram.
A circuit symbol for a JK flip-flop, where > is the clock input, J and K are data inputs, Q is the stored data output, and Q' is the inverse of Q.
A circuit symbol for a JK flip-flop, where > is the clock input, J and K are data inputs, Q is the stored data output, and Q' is the inverse of Q.
The characteristic equation of the JK flip-flop is:
Q_{next} = J\overline Q + \overline KQ
and the corresponding truth table is:
J K Qnext Comment
0 0 Q_{prev} \ hold state
0 1 0 \ reset
1 0 1 \ set
1 1 \overline{Q_{prev}} toggle
The origin of the name for the JK flip-flop is detailed by P. L. Lindley, a JPL engineer, in a letter to EDN, an electronics design magazine. The letter is dated June 13, 1968, and was published in the August edition of the newsletter. In the letter, Mr. Lindley explains that he heard the story of the JK flip-flop from Dr. Eldred Nelson, who is responsible for coining the term while working at Hughes Aircraft. Flip-flops in use at Hughes at the time were all of the type that came to be known as J-K. In designing a logical system, Dr. Nelson assigned letters to flip-flop inputs as follows: #1: A & B, #2: C & D, #3: E & F, #4: G & H, #5: J & K.
Another theory holds that the set and reset inputs were given the symbols "J" and "K" after one of the engineers that helped design the J-K flip-flop, Jack Kilby.
[edit] D flip-flop
D flip-flop symbol
D flip-flop symbol
The Q output always takes on the state of the D input at the moment of a rising clock edge, and never at any other time. [4] It is called the D flip-flop for this reason, since the output takes the value of the D input or Data input, and Delays it by one clock count. The D flip-flop can be interpreted as a primitive memory cell, zero-order hold, or delay line.
Truth table:
Clock D Q Qprev
Rising edge 0 0 X
Rising edge 1 1 X
Non-Rising X constant
('X' denotes a Don't care condition, meaning the signal is irrelevant)
These flip flops are very useful, as they form the basis for shift registers, which are an essential part of many electronic devices. The advantage of the D flip-flop over the D-type latch is that it "captures" the signal at the moment the clock goes high, and subsequent changes of the data line do not influence Q until the next rising clock edge. An exception is that some flip-flops have a 'reset' signal input, which will reset Q (to zero), and may be either asynchronous or synchronous with the clock.
3-bit shift register
3-bit shift register
The above circuit shifts the contents of the register to the right, one bit position on each active transition of the clock. The input X being shifted into the leftmost bit position.
[edit] Master-slave D flip-flop
A master-slave D flip-flop is created by connecting two gated D latches in series, and inverting the enable input to one of them. It is called master-slave because the second latch in the series only changes in response to a change in the first (master) latch.
A master slave D flip flop. It responds on the negative edge of the enable input (usually a clock).
A master slave D flip flop. It responds on the negative edge of the enable input (usually a clock).
For a positive-edge triggered master-slave D flip-flop, when the clock signal is low (logical 0) the “enable” seen by the first or “master” D latch (the inverted clock signal) is high (logical 1). This allows the “master” latch to store the input value when the clock signal transitions from low to high. As the clock signal goes high (0 to 1) the inverted “enable” of the first latch goes low (1 to 0) and the value seen at the input to the master latch is “locked”. Nearly simultaneously, the twice inverted “enable” of the second or “slave” D latch transitions from low to high (0 to 1) with the clock signal. This allows the signal captured at the rising edge of the clock by the now “locked” master latch to pass through the “slave” latch. When the clock signal returns to low (1 to 0), the output of the "slave" latch is "locked", and the value seen at the last rising edge of the clock is held while the “master” latch begins to accept new values in preparation for the next rising clock edge.
An implementation of a master-slave D flip-flop that is triggered on the positive edge of the clock.
An implementation of a master-slave D flip-flop that is triggered on the positive edge of the clock.
By removing the left-most inverter in the above circuit, a D-type flip flop that strobes on the falling edge of a clock signal can be obtained. This has a truth table like this:
D Q > Qnext
0 X Falling 0
1 X Falling 1
Most D-type flip-flops in ICs have the capability to be set and reset, much like an SR flip-flop. Usually, the illegal S = R = 1 condition is resolved in D-type flip-flops.
Inputs Outputs
S R D > Q Q'
0 1 X X 0 1
1 0 X X 1 0
1 1 X X 1 1
By setting S = R = 0, the flip-flop can be used as described above.
[edit] Edge-triggered D flip-flop
A more efficient way to make a D flip-flop is not as easy to understand, but it works the same way. While the master-slave D flip flop is also triggered on the edge of a clock, its components are each triggered by clock levels. The "edge-triggered D flip flop" does not have the master slave properties.
A positive-edge-triggered D flip-flop.
A positive-edge-triggered D flip-flop.
[edit] Uses
* A single flip-flop can be used to store one bit, or binary digit, of data.
* Static RAM, which is the primary type of memory used in registers to store numbers in computers and in many caches, is built out of flip-flops.
* Any one of the flip-flop types can be used to build any of the others.
* The data contained in several flip-flops may represent the state of a sequencer, the value of a counter, an ASCII character in a computer's memory or any other piece of information.
* One use is to build finite state machines from electronic logic. The flip-flops remember the machine's previous state, and digital logic uses that state to calculate the next state.
* The T flip-flop is useful for constructing various types of counters. Repeated signals to the clock input will cause the flip-flop to change state once per high-to-low transition of the clock input, if its T input is "1". The output from one flip-flop can be fed to the clock input of a second and so on. The final output of the circuit, considered as the array of outputs of all the individual flip-flops, is a count, in binary, of the number of cycles of the first clock input, up to a maximum of 2n-1, where n is the number of flip-flops used. See: Counters
* One of the problems with such a counter (called a ripple counter) is that the output is briefly invalid as the changes ripple through the logic. There are two solutions to this problem. The first is to sample the output only when it is known to be valid. The second, more widely used, is to use a different type of circuit called a synchronous counter. This uses more complex logic to ensure that the outputs of the counter all change at the same, predictable time. See: Counters
* Frequency division: a chain of T flip-flops as described above will also function to divide an input in frequency by 2n, where n is the number of flip-flops used between the input and the output.
[edit] Timing and metastability
A flip-flop in combination with a Schmitt trigger can be used for the implementation of an arbiter in asynchronous circuits.
Clocked flip-flops are prone to a problem called metastability, which happens when a data or control input is changing at the instant of the clock pulse. The result is that the output may behave unpredictably, taking many times longer than normal to settle to its correct state, or even oscillating several times before settling. Theoretically it can take infinite time to settle down. In a computer system this can cause corruption of data or a program crash.
Flip-flop setup, hold and clock-to-output timing parameters.
Flip-flop setup, hold and clock-to-output timing parameters.
The metastability in flip-flops can be avoided by ensuring that the data and control inputs are held valid and constant for specified periods before and after the clock pulse, called the setup time (tsu) and the hold time (th) respectively. These times are specified in the data sheet for the device, and are typically between a few nanoseconds and a few hundred picoseconds for modern devices.
Unfortunately, it is not always possible to meet the setup and hold criteria, because the flip-flop may be connected to a real-time signal that could change at any time, outside the control of the designer. In this case, the best the designer can do is to reduce the probability of error to a certain level, depending on the required reliability of the circuit. One technique for suppressing metastability is to connect two or more flip-flops in a chain, so that the output of each one feeds the data input of the next, and all devices share a common clock. With this method, the probability of a metastable event can be reduced to a negligible value, but never to zero. The probability of metastability gets closer and closer to zero as the number of flip-flops connected in series is increased.
So-called metastable-hardened flip-flops are available, which work by reducing the setup and hold times as much as possible, but even these cannot eliminate the problem entirely. This is because metastability is more than simply a matter of circuit design. When the transitions in the clock and the data are close together in time, the flip-flop is forced to decide which event happened first. However fast we make the device, there is always the possibility that the input events will be so close together that it cannot detect which one happened first. It is therefore logically impossible to build a perfectly metastable-proof flip-flop.
Another important timing value for a flip-flop is the clock-to-output delay (common symbol in data sheets: tCO) or propagation delay (tP), which is the time the flip-flop takes to change its output after the clock edge. The time for a high-to-low transition (tPHL) is sometimes different from the time for a low-to-high transition (tPLH).
When connecting flip-flops in a chain, it is important to ensure that the tCO of the first flip-flop is longer than the hold time (tH) of the second flip-flop, otherwise the second flip-flop will not receive the data reliably. The relationship between tCO and tH is normally guaranteed if both flip-flops are of the same type.
[edit] Flip-flop integrated circuits
Integrated circuit (ICs) exist that provide one or more flip-flops. For example, the 7473 Dual JK Master-Slave Flip-flop or the 74374, an octal D Flip-flop, in the 7400 series.
A flip-flop is controlled by (usually) one or two control signals and/or a gate or clock signal. The output often includes the complement as well as the normal output. As flip-flops are implemented electronically, they require power and ground connections.
Contents
[hide]
* 1 History
* 2 Implementation
* 3 Set-Reset flip-flops (SR flip-flops)
* 4 Toggle flip-flops (T flip-flops)
* 5 JK flip-flop
* 6 D flip-flop
* 7 Master-slave D flip-flop
o 7.1 Edge-triggered D flip-flop
* 8 Uses
* 9 Timing and metastability
* 10 Flip-flop integrated circuits
* 11 See also
* 12 Notes and references
* 13 External links
[edit] History
The first electronic flip-flop was invented in 1919 by William Eccles and F. W. Jordan.[1] It was initially called the Eccles-Jordan trigger circuit and consisted of two active elements (radio-tubes). The name flip-flop was later derived from the sound produced on a speaker connected with one of the back coupled amplifiers output during the trigger process within the circuit. This original electronic flip-flop was transparent - i.e. a simple two-input bistable circuit without any dedicated clock (or even gate) signal - and thus would probably have been labeled as a "latch" today.
[edit] Implementation
Flip-flops can be either simple (transparent) or clocked. Simple flip-flops can be built by two cross-coupled inverting elements – transistors, or NAND, or NOR-gates – perhaps augmented by some enable/disable (gating) mechanism. Clocked devices are specially designed for synchronous (time-discrete) systems and therefore one such device ignores its inputs except at the transition of a dedicated clock signal (known as clocking, pulsing, or strobing). This causes the flip-flop to either change or retain its output signal based upon the values of the input signals at the transition. Some flip-flops change output on the rising edge of the clock, others on the falling edge.
Clocked (non-transparent) flip-flops are typically implemented as master-slave devices[2] where two basic flip-flops (plus some additional logic) collaborate to make it insensitive to spikes and noise between the short clock transitions; they nevertheless also often include asynchronous clear or set inputs which may be used to change the current output independent of the clock.
Flip-flops can be further divided into types that have found common applicability in both asynchronous and clocked sequential systems: the SR ("set-reset"), D ("delay"[3]), T ("toggle"), and JK types are the common ones; all of which may be synthesized from (most) other types by a few logic gates. The behavior of a particular type can be described by what is termed the characteristic equation, which derives the "next" (i.e., after the next clock pulse) output, Qnext, in terms of the input signal(s) and/or the current output, Q.
[edit] Set-Reset flip-flops (SR flip-flops)
Main article: SR latch
The symbol for an SR latch.
The symbol for an SR latch.
The most fundamental latch is the simple SR latch (or simple SR flip-flop), where S and R stand for set and reset. It can be constructed from a pair of cross-coupled NOR (negative OR) logic gates. The stored bit is present on the output marked Q.
Normally, in storage mode, the S and R inputs are both low, and feedback maintains the Q and Q outputs in a constant state, with Q the complement of Q. If S (Set) is pulsed high while R is held low, then the Q output is forced high, and stays high even after S returns low; similarly, if R (Reset) is pulsed high while S is held low, then the Q output is forced low, and stays low even after R returns low.
SR latch operation
S R Action
0 0 Keep state
0 1 Q = 0
1 0 Q = 1
1 1 Unstable combination,
see race condition
[edit] Toggle flip-flops (T flip-flops)
A circuit symbol for a T-type flip-flop, where > is the clock input, T is the toggle input and Q is the stored data output.
A circuit symbol for a T-type flip-flop, where > is the clock input, T is the toggle input and Q is the stored data output.
If the T input is high, the T flip-flop changes state ("toggles") whenever the clock input is strobed. If the T input is low, the flip-flop holds the previous value. This behavior is described by the characteristic equation:
Q_{next} = T \oplus Q (or, without benefit of the XOR operator, the equivalent: Q_{next} = T\overline{Q} + \overline{T}Q )
and can be described in a truth table:
T Q Qnext Comment
0 0 0 hold state(no clk)
0 1 1 hold state(no clk)
1 0 1 toggle
1 1 0 toggle
When T is held high, the toggle flip-flop divides the clock frequency by two; that is, if clock frequency is 4 MHz, the output frequency obtained from the flip-flop will be 2 MHz. This 'divide by' feature has application in various types of digital counters. A T flip-flop can also be built using a JK flip-flop (J & K pins are connected together and act as T) or D flip-flop (T input and Qprevious is connected to the D input through an XOR gate).
[edit] JK flip-flop
JK flip-flop timing diagram
JK flip-flop timing diagram
The JK flip-flop augments the behavior of the SR flip-flop by interpreting the S = R = 1 condition as a "flip" or toggle command. Specifically, the combination J = 1, K = 0 is a command to set the flip-flop; the combination J = 0, K = 1 is a command to reset the flip-flop; and the combination J = K = 1 is a command to toggle the flip-flop, i.e., change its output to the logical complement of its current value. Setting J = K = 0 does NOT result in a D flip-flop, but rather, will hold the current state. To synthesize a D flip-flop, simply set K equal to the complement of J. The JK flip-flop is therefore a universal flip-flop, because it can be configured to work as an SR flip-flop, a D flip-flop, or a T flip-flop. NOTE: The flip flop is positive edge triggered (Clock Pulse) as seen in the timing diagram.
A circuit symbol for a JK flip-flop, where > is the clock input, J and K are data inputs, Q is the stored data output, and Q' is the inverse of Q.
A circuit symbol for a JK flip-flop, where > is the clock input, J and K are data inputs, Q is the stored data output, and Q' is the inverse of Q.
The characteristic equation of the JK flip-flop is:
Q_{next} = J\overline Q + \overline KQ
and the corresponding truth table is:
J K Qnext Comment
0 0 Q_{prev} \ hold state
0 1 0 \ reset
1 0 1 \ set
1 1 \overline{Q_{prev}} toggle
The origin of the name for the JK flip-flop is detailed by P. L. Lindley, a JPL engineer, in a letter to EDN, an electronics design magazine. The letter is dated June 13, 1968, and was published in the August edition of the newsletter. In the letter, Mr. Lindley explains that he heard the story of the JK flip-flop from Dr. Eldred Nelson, who is responsible for coining the term while working at Hughes Aircraft. Flip-flops in use at Hughes at the time were all of the type that came to be known as J-K. In designing a logical system, Dr. Nelson assigned letters to flip-flop inputs as follows: #1: A & B, #2: C & D, #3: E & F, #4: G & H, #5: J & K.
Another theory holds that the set and reset inputs were given the symbols "J" and "K" after one of the engineers that helped design the J-K flip-flop, Jack Kilby.
[edit] D flip-flop
D flip-flop symbol
D flip-flop symbol
The Q output always takes on the state of the D input at the moment of a rising clock edge, and never at any other time. [4] It is called the D flip-flop for this reason, since the output takes the value of the D input or Data input, and Delays it by one clock count. The D flip-flop can be interpreted as a primitive memory cell, zero-order hold, or delay line.
Truth table:
Clock D Q Qprev
Rising edge 0 0 X
Rising edge 1 1 X
Non-Rising X constant
('X' denotes a Don't care condition, meaning the signal is irrelevant)
These flip flops are very useful, as they form the basis for shift registers, which are an essential part of many electronic devices. The advantage of the D flip-flop over the D-type latch is that it "captures" the signal at the moment the clock goes high, and subsequent changes of the data line do not influence Q until the next rising clock edge. An exception is that some flip-flops have a 'reset' signal input, which will reset Q (to zero), and may be either asynchronous or synchronous with the clock.
3-bit shift register
3-bit shift register
The above circuit shifts the contents of the register to the right, one bit position on each active transition of the clock. The input X being shifted into the leftmost bit position.
[edit] Master-slave D flip-flop
A master-slave D flip-flop is created by connecting two gated D latches in series, and inverting the enable input to one of them. It is called master-slave because the second latch in the series only changes in response to a change in the first (master) latch.
A master slave D flip flop. It responds on the negative edge of the enable input (usually a clock).
A master slave D flip flop. It responds on the negative edge of the enable input (usually a clock).
For a positive-edge triggered master-slave D flip-flop, when the clock signal is low (logical 0) the “enable” seen by the first or “master” D latch (the inverted clock signal) is high (logical 1). This allows the “master” latch to store the input value when the clock signal transitions from low to high. As the clock signal goes high (0 to 1) the inverted “enable” of the first latch goes low (1 to 0) and the value seen at the input to the master latch is “locked”. Nearly simultaneously, the twice inverted “enable” of the second or “slave” D latch transitions from low to high (0 to 1) with the clock signal. This allows the signal captured at the rising edge of the clock by the now “locked” master latch to pass through the “slave” latch. When the clock signal returns to low (1 to 0), the output of the "slave" latch is "locked", and the value seen at the last rising edge of the clock is held while the “master” latch begins to accept new values in preparation for the next rising clock edge.
An implementation of a master-slave D flip-flop that is triggered on the positive edge of the clock.
An implementation of a master-slave D flip-flop that is triggered on the positive edge of the clock.
By removing the left-most inverter in the above circuit, a D-type flip flop that strobes on the falling edge of a clock signal can be obtained. This has a truth table like this:
D Q > Qnext
0 X Falling 0
1 X Falling 1
Most D-type flip-flops in ICs have the capability to be set and reset, much like an SR flip-flop. Usually, the illegal S = R = 1 condition is resolved in D-type flip-flops.
Inputs Outputs
S R D > Q Q'
0 1 X X 0 1
1 0 X X 1 0
1 1 X X 1 1
By setting S = R = 0, the flip-flop can be used as described above.
[edit] Edge-triggered D flip-flop
A more efficient way to make a D flip-flop is not as easy to understand, but it works the same way. While the master-slave D flip flop is also triggered on the edge of a clock, its components are each triggered by clock levels. The "edge-triggered D flip flop" does not have the master slave properties.
A positive-edge-triggered D flip-flop.
A positive-edge-triggered D flip-flop.
[edit] Uses
* A single flip-flop can be used to store one bit, or binary digit, of data.
* Static RAM, which is the primary type of memory used in registers to store numbers in computers and in many caches, is built out of flip-flops.
* Any one of the flip-flop types can be used to build any of the others.
* The data contained in several flip-flops may represent the state of a sequencer, the value of a counter, an ASCII character in a computer's memory or any other piece of information.
* One use is to build finite state machines from electronic logic. The flip-flops remember the machine's previous state, and digital logic uses that state to calculate the next state.
* The T flip-flop is useful for constructing various types of counters. Repeated signals to the clock input will cause the flip-flop to change state once per high-to-low transition of the clock input, if its T input is "1". The output from one flip-flop can be fed to the clock input of a second and so on. The final output of the circuit, considered as the array of outputs of all the individual flip-flops, is a count, in binary, of the number of cycles of the first clock input, up to a maximum of 2n-1, where n is the number of flip-flops used. See: Counters
* One of the problems with such a counter (called a ripple counter) is that the output is briefly invalid as the changes ripple through the logic. There are two solutions to this problem. The first is to sample the output only when it is known to be valid. The second, more widely used, is to use a different type of circuit called a synchronous counter. This uses more complex logic to ensure that the outputs of the counter all change at the same, predictable time. See: Counters
* Frequency division: a chain of T flip-flops as described above will also function to divide an input in frequency by 2n, where n is the number of flip-flops used between the input and the output.
[edit] Timing and metastability
A flip-flop in combination with a Schmitt trigger can be used for the implementation of an arbiter in asynchronous circuits.
Clocked flip-flops are prone to a problem called metastability, which happens when a data or control input is changing at the instant of the clock pulse. The result is that the output may behave unpredictably, taking many times longer than normal to settle to its correct state, or even oscillating several times before settling. Theoretically it can take infinite time to settle down. In a computer system this can cause corruption of data or a program crash.
Flip-flop setup, hold and clock-to-output timing parameters.
Flip-flop setup, hold and clock-to-output timing parameters.
The metastability in flip-flops can be avoided by ensuring that the data and control inputs are held valid and constant for specified periods before and after the clock pulse, called the setup time (tsu) and the hold time (th) respectively. These times are specified in the data sheet for the device, and are typically between a few nanoseconds and a few hundred picoseconds for modern devices.
Unfortunately, it is not always possible to meet the setup and hold criteria, because the flip-flop may be connected to a real-time signal that could change at any time, outside the control of the designer. In this case, the best the designer can do is to reduce the probability of error to a certain level, depending on the required reliability of the circuit. One technique for suppressing metastability is to connect two or more flip-flops in a chain, so that the output of each one feeds the data input of the next, and all devices share a common clock. With this method, the probability of a metastable event can be reduced to a negligible value, but never to zero. The probability of metastability gets closer and closer to zero as the number of flip-flops connected in series is increased.
So-called metastable-hardened flip-flops are available, which work by reducing the setup and hold times as much as possible, but even these cannot eliminate the problem entirely. This is because metastability is more than simply a matter of circuit design. When the transitions in the clock and the data are close together in time, the flip-flop is forced to decide which event happened first. However fast we make the device, there is always the possibility that the input events will be so close together that it cannot detect which one happened first. It is therefore logically impossible to build a perfectly metastable-proof flip-flop.
Another important timing value for a flip-flop is the clock-to-output delay (common symbol in data sheets: tCO) or propagation delay (tP), which is the time the flip-flop takes to change its output after the clock edge. The time for a high-to-low transition (tPHL) is sometimes different from the time for a low-to-high transition (tPLH).
When connecting flip-flops in a chain, it is important to ensure that the tCO of the first flip-flop is longer than the hold time (tH) of the second flip-flop, otherwise the second flip-flop will not receive the data reliably. The relationship between tCO and tH is normally guaranteed if both flip-flops are of the same type.
[edit] Flip-flop integrated circuits
Integrated circuit (ICs) exist that provide one or more flip-flops. For example, the 7473 Dual JK Master-Slave Flip-flop or the 74374, an octal D Flip-flop, in the 7400 series.
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
Pipelining assumes that with a single instruction (SISD) concept successive instructions in a program sequence will overlap in execution, as suggested in the next diagram (vertical 'i' instructions, horizontal 't' time).
Most modern CPUs are driven by a clock. The CPU consists internally of logic and flip flops. When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back
Hazards: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is inputted to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Contents
[hide]
* 1 Advantages and Disadvantages
* 2 Examples
o 2.1 Generic pipeline
+ 2.1.1 Bubble
o 2.2 Example 1
o 2.3 Example 2
* 3 Complications
* 4 See also
* 5 External links
[edit] Advantages and Disadvantages
Pipelining does not help in all cases. There are several disadvantages associated. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.
Advantages of Pipelining:
1. The cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.
Disadvantages of Pipelining:
1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.
3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
[edit] Examples
[edit] Generic pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
To the right is a generic pipeline with four stages:
1. Fetch
2. Decode
3. Execute
4. Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time Execution
0 Four instructions are awaiting to be executed
1
* the green instruction is fetched from memory
2
* the green instruction is decoded
* the purple instruction is fetched from memory
3
* the green instruction is executed (actual operation is performed)
* the purple instruction is decoded
* the blue instruction is fetched
4
* the green instruction's results are written back to the register file or memory
* the purple instruction is executed
* the blue instruction is decoded
* the red instruction is fetched
5
* the green instruction is completed
* the purple instruction is written back
* the blue instruction is executed
* the red instruction is decoded
6
* The purple instruction is completed
* the blue instruction is written back
* the red instruction is executed
7
* the blue instruction is completed
* the red instruction is written back
8
* the red instruction is completed
9 All instructions are executed
[edit] Bubble
A bubble in cycle 3 delays execution
A bubble in cycle 3 delays execution
Main article: Bubble (computing)
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple instruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7.
Bubbles are unlike stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code.
[edit] Example 1
A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to:
LOAD A, R1
LOAD B, R2
ADD R1, R2, R3
STORE R3, C
LOAD next instruction
The locations 'R1' and 'R2' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'.
In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages.
On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage.
Pipelining doesn't reduce the time it takes to complete an instruction rather it increases the number of instructions that can be processed at once and it reduces the delay between completed instructions- called 'throughput'. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline). Intel Pentium 4 processors have 20 stage pipelines.
[edit] Example 2
To better visualize the concept, we can look at a theoretical 3-stage pipeline:
Stage Description
Load Read instruction from memory
Execute Execute instruction
Store Store result in memory and/or registers
and a pseudo-code assembly listing to be executed:
LOAD #40, A ; load 40 in A
MOVE A, B ; copy A in B
ADD #20, B ; add 20 to B
STORE B, 0x300 ; store B into memory cell 0x300
This is how it would be executed:
Clock 1 Load Execute Store
LOAD
The LOAD instruction is fetched from memory.
Clock 2 Load Execute Store
MOVE LOAD
The LOAD instruction is executed, while the MOVE instruction is fetched from memory.
Clock 3 Load Execute Store
ADD MOVE LOAD
The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction.
Clock 4 Load Execute Store
STORE ADD MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating.
And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do.
[edit] Complications
Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4) The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long [1]. The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch predicting itself can end up exacerbating the problem if branches are predicted poorly. In certain applications, such as supercomputing, programs are specially written to rarely branch and so very long pipelines are ideal to speed up the computations, as long pipelines are designed to reduce clocks per instruction (CPI). If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis, however such analysis is often a last-resort for optimization.
The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to "work" again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages.
Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs.
Pipelining assumes that with a single instruction (SISD) concept successive instructions in a program sequence will overlap in execution, as suggested in the next diagram (vertical 'i' instructions, horizontal 't' time).
Most modern CPUs are driven by a clock. The CPU consists internally of logic and flip flops. When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the RISC pipeline is broken into five stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back
Hazards: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is inputted to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Contents
[hide]
* 1 Advantages and Disadvantages
* 2 Examples
o 2.1 Generic pipeline
+ 2.1.1 Bubble
o 2.2 Example 1
o 2.3 Example 2
* 3 Complications
* 4 See also
* 5 External links
[edit] Advantages and Disadvantages
Pipelining does not help in all cases. There are several disadvantages associated. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.
Advantages of Pipelining:
1. The cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.
Disadvantages of Pipelining:
1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.
3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
[edit] Examples
[edit] Generic pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
To the right is a generic pipeline with four stages:
1. Fetch
2. Decode
3. Execute
4. Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline.
Execution is as follows:
Time Execution
0 Four instructions are awaiting to be executed
1
* the green instruction is fetched from memory
2
* the green instruction is decoded
* the purple instruction is fetched from memory
3
* the green instruction is executed (actual operation is performed)
* the purple instruction is decoded
* the blue instruction is fetched
4
* the green instruction's results are written back to the register file or memory
* the purple instruction is executed
* the blue instruction is decoded
* the red instruction is fetched
5
* the green instruction is completed
* the purple instruction is written back
* the blue instruction is executed
* the red instruction is decoded
6
* The purple instruction is completed
* the blue instruction is written back
* the red instruction is executed
7
* the blue instruction is completed
* the red instruction is written back
8
* the red instruction is completed
9 All instructions are executed
[edit] Bubble
A bubble in cycle 3 delays execution
A bubble in cycle 3 delays execution
Main article: Bubble (computing)
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple instruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7.
Bubbles are unlike stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code.
[edit] Example 1
A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to:
LOAD A, R1
LOAD B, R2
ADD R1, R2, R3
STORE R3, C
LOAD next instruction
The locations 'R1' and 'R2' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into these registers, then added, and the result is stored in a memory location labeled 'C'.
In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages.
On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage.
Pipelining doesn't reduce the time it takes to complete an instruction rather it increases the number of instructions that can be processed at once and it reduces the delay between completed instructions- called 'throughput'. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline). Intel Pentium 4 processors have 20 stage pipelines.
[edit] Example 2
To better visualize the concept, we can look at a theoretical 3-stage pipeline:
Stage Description
Load Read instruction from memory
Execute Execute instruction
Store Store result in memory and/or registers
and a pseudo-code assembly listing to be executed:
LOAD #40, A ; load 40 in A
MOVE A, B ; copy A in B
ADD #20, B ; add 20 to B
STORE B, 0x300 ; store B into memory cell 0x300
This is how it would be executed:
Clock 1 Load Execute Store
LOAD
The LOAD instruction is fetched from memory.
Clock 2 Load Execute Store
MOVE LOAD
The LOAD instruction is executed, while the MOVE instruction is fetched from memory.
Clock 3 Load Execute Store
ADD MOVE LOAD
The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction.
Clock 4 Load Execute Store
STORE ADD MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating.
And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do.
[edit] Complications
Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4) The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long [1]. The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch predicting itself can end up exacerbating the problem if branches are predicted poorly. In certain applications, such as supercomputing, programs are specially written to rarely branch and so very long pipelines are ideal to speed up the computations, as long pipelines are designed to reduce clocks per instruction (CPI). If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis, however such analysis is often a last-resort for optimization.
The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to "work" again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages.
Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs.
Подписаться на:
Сообщения (Atom)