As you may recall, in my previous column, I started to consider why the 30-year-old eight-bit 8051 microcontroller remains a valuable player in the embedded systems game. The days, months, and years go by, but this tiny microcontroller remains incredibly popular.
After that column, many members of All Programmable Planet posted comments or sent emails asking how was it possible for us to redesign the architecture to achieve such high performance improvements. Sadly, I have to say that much of this information is classified. If I were to tell you, my colleagues might cause me some bodily harm. On the other hand, I know that if I don't explain something about what we did to achieve the world's fastest 8051, you (or more likely Max, our editor in chief) will make my life painful, so here's at least part of the story.
Several improved versions of the 8051 are available right now. Some are available to any engineer who wishes to use them in projects. Others are produced as custom processors for internal use only. Many of these 8051 variants were designed by different IP vendors. That typically means different internal architectures -- the main requirement is that they are all compatible with the original 8051's instruction set. Each of these architectures may offer different improvement factors over the traditional 8051. (This is one reason I'm going to write in my next column about some of the things that should be considered when purchasing IP cores from third-party vendors.)
How is it possible that modern versions of this microcontroller can execute the same instruction set and be clocked with the same frequency but offer much higher performance?
In fact, designers who work on a new architecture for the 8051 must decompose each and every instruction executed by the processor into its basic factors. For example, the original processor architecture required 12 clock cycles to execute even the simplest instruction like a NOP (no operation). More complex operations required some multiple of 12 cycles. To speed up the instruction execution flow, we determine the actions a particular instruction will perform, and then we consider how to design the ALU (arithmetic logic unit) and the control unit responsible for internal operations to make the time required for instruction execution as short as possible.
In the case of the NOP instruction, one has to ask, "Why should an instruction that doesn't actually do anything require so much time to execute?" I mean, 12 clock cycles to do nothing? Since we are in the depths of a worldwide economic crisis, this is far too long. Let's cut the execution time to a single clock cycle.
Next, we look at more complex operations like the MUL (multiplication) instruction, which consumed a humongous 48 clock cycles in the original architecture. We must analyze what this instruction is required to do, ask what is required to perform an eight-by-eight bit multiplication in binary, and then try to find a better solution. Why not execute several non-overlapping steps in a single clock period? Our machinations reduced the MUL instruction to only two clock cycles. Similarly, we reduced a DIV (division) instruction from 48 clock cycles to six and an ADD (addition) instruction from 12 clock cycles to one.
And so it goes, instruction by instruction. Of course, we can also add some features that will help the CPU fetch new instructions from memory. For example, we can use additional DPTRs (data page pointers) and automatic DPTR increment/decrement to speed up external memory addressing and accessing. The final result is to make the 8051 architecture approximately 15 times faster than the original for the DP8051 and more than 26 times faster in the case of the DQ8051. This is all when running at the same clock frequency. (The new architectures can actually be run up to 25 times faster than the original 8051.)
The bottom line is that there are as many different approaches to improving the 8051 architecture as there are IP vendors working on the problem. Purely for the sake of interest, here are some comparisons of alternative implementations. First, let's consider performance (or processing power) as measured in DMIPS/MHz.
Next, let's consider the silicon area required to implement the different cores. In this case, we will use the number of equivalent ASIC (two-input NAND) gates as our metric. These values are readily available, and the area on the silicon for an ASIC/SoC, or the amount of lookup table (LUT) resources consumed in an FPGA, is a function of the number of equivalent gates.
I don't know about you, but like many people, I find it easier to visualize what these numbers mean when they are presented in the form of a graphical image, as shown below.
In both cases, the DQ8051 is approximately 21 percent smaller (that is, it requires 21 percent fewer equivalent gates) than its competitors. Of course, this also reduces power consumption.
I hope that I've given you at least a hint as to what we did to create our ultra-fast 8051 architecture without giving away too many secrets. If I have said too much, the next time you see me, I may be walking in a strange way and talking in a high-pitched voice. Keeping this in mind, are there any other 8051-related questions you would like to ask me?