Vector Processor - Description

Description

In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Memory wall.

In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of operations much faster than if it did so one at a time.

Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. There are fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, it reads a single instruction from memory, and "knows" that the next address will be one larger than the last. This allows for significant savings in decoding time.

To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:

execute this loop 10 times read the next instruction and decode it fetch this number fetch that number add them put the result here end loop

But to a vector processor, this task looks considerably different:

read instruction and decode it fetch these 10 numbers fetch those 10 numbers add them put the results here

There are several savings inherent in this approach. For one, only two address translations are needed. Depending on the architecture, this can represent a significant savings by itself. Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten. The code itself is also smaller, which can lead to more efficient memory use.

But more than that, a vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can improve performance by avoiding stalls.

As mentioned earlier, the Cray implementations took this a step further, allowing several different types of operations to be carried out at the same time. Consider code that adds two numbers and then multiplies by a third; in the Cray, these would all be fetched at once, and both added and multiplied in a single operation. Using the pseudocode above, the Cray did:

read instruction and decode it fetch these 10 numbers fetch those 10 numbers fetch another 10 numbers add and multiply them put the results here

The math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory.

Not all problems can be attacked with this sort of solution. Adding these sorts of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding.

In fact, vector processors work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were, in general, found in places such as weather prediction centres and physics labs, where huge amounts of data are "crunched".

Read more about this topic:  Vector Processor

Famous quotes containing the word description:

    Once a child has demonstrated his capacity for independent functioning in any area, his lapses into dependent behavior, even though temporary, make the mother feel that she is being taken advantage of....What only yesterday was a description of the child’s stage in life has become an indictment, a judgment.
    Elaine Heffner (20th century)

    The next Augustan age will dawn on the other side of the Atlantic. There will, perhaps, be a Thucydides at Boston, a Xenophon at New York, and, in time, a Virgil at Mexico, and a Newton at Peru. At last, some curious traveller from Lima will visit England and give a description of the ruins of St Paul’s, like the editions of Balbec and Palmyra.
    Horace Walpole (1717–1797)

    The next Augustan age will dawn on the other side of the Atlantic. There will, perhaps, be a Thucydides at Boston, a Xenophon at New York, and, in time, a Virgil at Mexico, and a Newton at Peru. At last, some curious traveller from Lima will visit England and give a description of the ruins of St. Paul’s, like the editions of Balbec and Palmyra.
    Horace Walpole (1717–1797)