Reverse-engineering the Intel 8086's instruction register

kens · on Aug 8, 2020

Author here if anyone has questions about the 8086. What part of the 8086 chip should I write about next?

anyfoo · on Aug 9, 2020

Not directly related to the 8086 per se, but should you ever have a look at the (80)386, a question that I always had was whether the perpetually reserved control register CR1 (CR2 and CR3 already appeared in the 386 as well, and later generations added CR4, but CR1 remained reserved) ever had any meaning that could actually be found in hardware.

I went so far as asking the chief engineer of the 386, but only got a short answer back confirming that it is “reserved”. The closest I got to a promising answer was when talking with the maintainer of os2museum.com, who, based on pre-release data sheets of the 386, speculated that it may have been a control register for the removed on-chip cache. But I’m still wondering if accesses to CR1 are just hardcoded to be undefined instructions, or if there is something somewhere.

kens · on Aug 9, 2020

Interesting question, but the 80386 is way beyond my abilities and my microscope.

anyfoo · on Aug 9, 2020

I understand, just for the very unlikely case that a clear depiction of the control register file comes up somewhere or something like that.

Thank you for the great series! I’ve been a long time reader, and the 8086 and its more immediate successors have a special place in my heart.

raphlinus · on Aug 8, 2020

I personally would be interested in a tour of "rep movsb", as I consider it one of the interesting (and enduring) features of x86. I figure this'll take you pretty deep into microcode, especially the register-dependent sequencing.

messe · on Aug 8, 2020

I've done my fair share of 8086 programming (mostly bootloader stuff and some DOS coding for fun), this is (along with the cmpsb, scasb, stosb, insb, lodsb, outsb instructions) probably the most CISCy instruction(s) the x86 has, especially when you consider that the rep prefix can use the ZF flag (repe, repne) to decide whether to repeat as well as the value in CX. I'd love to see a write-up on it.

zwegner · on Aug 8, 2020

That would indeed be a pretty good topic to hear about.

An interesting story related to rep movsb was how the original Pentium-based Larrabee supported gather/scatter, as described in this slide deck from Tom Forsyth: http://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%... (gather/scatter info on slides 48-61, important background about virtual memory/pipeline in slides 21-28)

A summary:

The original Pentium had a very simple pipeline that could only access one cache line per instruction, and had to determine if there was a page fault early in the pipeline. This presented a problem for the Larrabee design, which needed to support gathers/scatters (which are vectorized loads/stores, and can read/write up to 16 different cache lines in one instruction). Additionally, the Pentium microcode system was quite slow, and gather/scatter performance was very important for many workloads.

To solve this, they looked at how rep movsb (which can read/write an arbitrary number of cache lines) works: as it executes, rep movsb modifies the cx, si, and di registers in place (or their 32/64-bit counterparts). This side effect actually helps keep the implementation simple, because if an interrupt occurs in the middle of the instruction, the registers keep all the state needed to continue executing after control returns to the instruction--it just keeps copying from where it left off.

So gathers/scatters used the mask registers from the Larrabee vector instructions to keep state: a bit in the mask indicates that a vector lane still needs to be read/written. The instruction is then run in a manual loop (not a microcoded rep prefix) until all bits in the mask are cleared. Interestingly, this implementation can be faster than gather/scatter in a modern AVX-512 out-of-order core: whenever multiple vector lanes point to addresses in the same cache line, those loads/stores can all execute at once; in contrast, in the big cores gather/scatter are split into one load/store micro-op for each lane, that each execute independently. And the masking isn't taken into account before the uop split, so gathering with only a single address unmasked can still take more than 16 uops (see https://mobile.twitter.com/trav_downs/status/122333322625875...)

monocasa · on Aug 8, 2020

You see similar constructs in about any system where the system both touches multiple addresses in a single instruction, and can take an interrupt somewhere in the middle of that instruction (whether that be an exception generated by the instruction, or just to keep external interrupt latency down).

The lowly Cortex-M0 exposes the progress of load/store multiple instructions in architectural state so they can be restarted where they left off after an interrupt for example. They even do this with the multiplier too, so if you have the slow but tiny 32 cycle iterative multiplier in your design, you can still get single cycle interrupt latency.

The M68ks had a halfway mechanism where it would just barf up partially documented internal microcode state onto the stack in a chip version specific manner on exceptions in the middle of instructions so you didn't restart the full instruction. Probably the grossest thing about that architecture.

Taniwha · on Aug 9, 2020

Not quite:

The 68000 didn't do this, but couldn't restart a page fault.

The 68010 didn't either (but could restart a page fault).

The 68020 and 030 did do this horrible thing - doing Unix recursive signals was pretty hard if not impossible. And you couldn't copy this stuff to the user stack because it wasn't documented and so therefore you couldn't validate it when you pulled it back into the kernel.

The 68040 was sane again (and I presume subsequent 68ks)

Really this is part of the CISC vs. RISC thing - RISC instructions tend to have 1 side effect only, either they run to completion, or not at all, but CISC instructions can have multiple side effects - consider the infamous pdp11 instruction "mov -(pc), -(pc)" 3 side effects - 68k instructions are more complex multiple memory indirects, many possible faults, all that crud on the stack represents half done stuff

beagle3 · on Aug 8, 2020

The 8088 (maybe 8086 too) had a bug where an interrupt during a “rep movsb” would return to the rep, so any prefixes were lost.

You had to do e.g. “rep cs: movsb” instead of “cs: rep movsb” because the latter would switch to ds: on return from interrupt. Fun times.

dt3ft · on Aug 8, 2020

Thank you for doing this. 8086 was my very first chip and I used it to run my qBasic "hello world" program on an IBM PC XT. Back then, I thought I was somewhat smart for being able to write a simple program, without giving much thought to the amazing work that went into creating these chips and related technology. I see what you're doing with reverse engineering and documenting as a very important part in preserving knowledge which is otherwise very likely to be forever lost due to IP, copyright and corporate policies. Thank you for your work. I hope you'll find strength and energy to continue.

fsckboy · on Aug 8, 2020

that was an 8088, which of course had the same core as the 8086, but 8 bit data paths to get on/off chip to save system costs at the cost of some speed.

dt3ft · on Aug 8, 2020

Thanks, this could be it. The clock frequency was mighty 4.77 MHz and boy was it fast when you typed "DIR" :)

robin_reala · on Aug 8, 2020

Ah good, it wasn’t just me that measured the increase in performance of 286 -> 386 -> 486 with DIR then!

LVB · on Aug 8, 2020

For me it was watching pfs:Write do spell check, which of course involved a status box showing each word as it was checked to show just how fast computers are :-)

dt3ft · on Aug 8, 2020

Haha, good old times :)

sabas123 · on Aug 8, 2020

I would die for an good proper view of the instruction decoder. While there are some (still not many) resources on microcode. I have yet to see an extensive look at the exact logic of the instruction decoder.

I don't know if the following is also the case for 8086 but recent cpu's have seperate decoding units for sub 4/5 bytes instructions. I would be interested in how it is known that an instruction is able to be decoded by those units of which multiple exist instead of the single full 15 byte decoder

kens · on Aug 8, 2020

I'm looking into the instruction decoder, but the 8086 is probably too early to answer the questions you're interested in. The 8086 patent [1] is a non-awful patent that provides a lot of information, including the instruction decoder.

For something slightly more recent, the book "Modern Processor Design" has 40 pages on the microarchitecture of the P6 processor. In brief, the P6 has three decoders running in parallel. An Instruction Length Decoder determines instruction boundaries before decoding, so the decoders receive aligned instructions. Only Decoder 0 can handle complex instructions that require microcode. It can issue 4 micro-ops per cycle, while limited-functionality Decoders 1 and 2 only issue 1 micro-op. The micro-ops are queued with up to 3 micro-ops issued per cycle for execution.

[1] https://patents.google.com/patent/US4449184

alblue · on Aug 8, 2020

I don’t think this is going to answer your question, but I gave a presentation at QCon London this year (back when in person conferences were a thing) on how CPU intervals work.

In terms of decoding instructions, there’s a stage in the pipeline which estimates the length of the (variable length) instructions and then passes is through to the next stage, which will then try and parse it. A failure here may cause a pipeline stall. Then there are other caches (like the loop stream decoder) which will bypass the instruction parsing for short loops, provided that the start of the loop is aligned to 16-bit. Processors can issue up to 4 wide (5 on recent intel CPUs) but some instructions may only be parsed by the first decoder. ISTR there’s more detail in Agner’s optimisation guides on this.

https://www.infoq.com/presentations/microarchitecture-modern...

Vogtinator · on Aug 8, 2020

What I found really interesting about the 6502 reverse engineering is how instructions are sequenced and what the PLA contains. That way some bugs and undocumented/illegal opcodes could be explained (https://www.pagetable.com/?p=39). I read that the 8086 uses some more "random logic" for some simpler instructions, but I'd really like to know how the more complex instructions (stosb, iret) look internally. I assume that needs some good understanding of all the other parts first though.

madmoose · on Aug 8, 2020

I won't tell you what to work on next but I would be very interested in an analysis of the 8086/8088 instruction decoding and microcode structure. TBH I watch and read everything you do with great interest.

simmons · on Aug 8, 2020

I'll second this. From the original article:

> This is only a piece of the 8086's complex instruction handling. Other latches hold pieces of the instruction indicating register usage and the ALU operation, while a separate circuit controls the microcode engine...

I've always been amazed at how complex the 8086 instruction encoding is, especially compared to, say, the 6502 (which admittedly is a much less sophisticated microprocessor). I especially think this whenever I play around with a toy 8086 emulator I've written. :) It's never struck me as being greatly elegant from a software perspective, but I assume there are some good electrical engineering reasons. (And I'm intrigued about the mention of microcode... I always assumed that came later in the x86 series.)

kens · on Aug 8, 2020

I'm working on the instruction decoding and the microcode. (The microcode ROM takes up a large part of the chip.) The 8086 patent and some articles discuss the microcode in some detail, but not enough to understand it. So I'm figuring out the microcode encoding.

As for the 8086 instruction set, it's kind of a jumble due to the need for backwards compatibility (of a sort) with the 8080. The 8086 instruction set doesn't make things easy for the chip implementation. Intel tried to make a nicer instruction set (with the i432 and then Itanium) but it didn't work out.

PhantomGremlin · on Aug 8, 2020

The 8086 instruction set doesn't make things easy for the chip implementation. Intel tried to make a nicer instruction set (with the i432 and then Itanium) but it didn't work out.

The i432 instruction set was, most emphatically, not nicer for implementation. E.g. it had bit-aligned variable-length instructions.

Let me repeat that, because it's totally insane:

   bit-aligned variable-length instructions[1]

I knew quite a few people who worked on the i432 implementation, and architectural decisions like that nearly drove them insane.

Intel top management at the time didn't have people who understood what was happening, and consequently let the i432 architects "play in the sand".

I always enjoyed that witty pun: "play in the sand" is what young children do in sandboxes; "sand" is mostly silicon dioxide and is often slang for what silicon chips are made of.

[1] https://en.wikipedia.org/wiki/IAPX432#The_project's_failures

kens · on Aug 8, 2020

You are definitely correct. I meant that the instruction was nicer in a holistic sense, not better for implementation. (RISC is the way to go if you want to make the implementation easy. It's remarkable how straightforward the ARM1 chip is, for instance.)

I encourage people to read about the i432 processor, to get an idea of how wildly different a processor design can be. Among other things, objects are built into the processor. Object pointers are created by the processor, so you can't make an out-of-bounds pointer. (In other words, many current security issues don't exist.) The processor includes garbage collection, in hardware.

nullc · on Aug 9, 2020

You should checkout S390x-- I've been chasing an issue with crypto code being non-constant time because the instruction set includes the functional equivalent of memcmp and gcc emits it liberally e.g. for integer comparisons.

... and it's probably one of the least weird ultra-ciscy instructions that architecture has.

kps · on Aug 9, 2020

Itanium seems to me a demonstration that Intel learned absolutely nothing from the i860: high theoretical peak performance, but absolutely useless in practice. (I still think BiiN / i960MX was relatively nice, and the i960CA showed that a modern implementation would work.)

userbinator · on Aug 8, 2020

I thought the opposite --- the 6502's instruction encoding is the odd one because it's constrained by what a single PLA can do, while the 8086 and its predecessors (including the Z80 and going back to the 8008) have a pretty consistent octal-based encoding:

https://news.ycombinator.com/item?id=13045558

http://www.z80.info/decoding.htm

kens · on Aug 8, 2020

I'd categorize things a bit differently. The chips all have an octal encoding, although the 6502 groups the bits from the left (so the encoding is even less useful). (The strange thing is these chips are always documented in hexadecimal, which obscures most of the structure.) The Z-80, like the 6502, uses a PLA, while the 8068 uses a weird hybrid PLA-ROM for microcode.

All the chips have a lot of internal structure to their opcodes, since certain bit fields specify the ALU operation, register, etc. Overall, I'd say the 6502 is the most structured, both because its instruction set is simple and because they didn't have backwards-compatibility to worry about. The 8086 has a lot of special-case instructions wedged into random spots, as well as a lot of inconsistency. For instance, sometimes bits 5-7 specify the register, sometimes bits 2-4, sometimes bits 3-4. And then there are the multi-byte "group" instructions tossed into the mix.

bonzini · on Aug 9, 2020

The x86 encoding is fairly simple as there are only four kind of opcodes:

* one byte with implicit operands, for example LOOP or RET or ADD AX,imm

* one byte shortcut encoding with registers in bits 0-2 of the only opcode byte, these are 16-bit instruction only: INC/DEC, PUSH/POP, XCHG AX,rr (these instructions are also available as 2-byte encodings)

* two byte with one register and one register/memory operand, both encoded in the second byte (bits 3-5 and 0-2/6-7, with bit 0 of the first byte determining the size and bit 1 of the first byte determining if the source is the register or the register/memory operand)

* two byte with a single register/memory operand, in which case bits 3-5 of the second byte are part of the opcode

Prefixes and immediate operands complicate things a bit but that's it. It's refreshingly simple compared to the 386 and later.

An interesting tidbit is that registers are ordered AX/CX/DX/BX in the instruction set encoding because their function roughly matches AF/BC/DE/HL in the 8080 (BX comes last because like HL it can be used to address memory, even though the encoding of memory operands is completely different on the 8086)

ajross · on Aug 8, 2020

It might be hard to trace but I'd love to see a discussion on clock distribution. Maybe this was a bit too early in VLSI to show some of the weird tricks that came along later, though.

kens · on Aug 8, 2020

I was thinking of writing about the clock. The circuitry to generate the two-phase clock from the single phase input clock is a bit interesting. There are multiple stages of large transistors to produce the high-current internal clock with two disjoint phases.

The clock distribution itself doesn't seem to have any weird tricks, just the two clock lines that wind around the chip. I don't see any optimizations to equalize length to different parts of the chip, for instance.

kens · on Aug 15, 2020

You're in luck: I just wrote a blog post about clock and power distribution in the 8086: http://www.righto.com/2020/08/how-8086-processor-handles-pow...

adrian_b · on Aug 8, 2020

You have already reverse-engineered the address adder, maybe it is time to study the ALU and write about it.

I am very grateful for your published work, thanks a lot !

kens · on Aug 8, 2020

I'm working on the ALU right now :-) So far, it looks much more similar to the address adder than I expected. The same basic structure, but with control signals that adjust how the inputs are combined, so you get multiple operations.

egsmi · on Aug 9, 2020

Can you write about how they did IO then? For example, was there ESD protection?

I’d be interested in a compare and contrast to a MOSIS or UoU IO PAD cell.

http://pages.hmc.edu/harris/cmosvlsi/4e/lect/lect23.pdf

Oh ya, love the articles! It’s very interesting to compare an older successful design to more modern techniques.

P.S. Are there multiple clock domains in a 8086? If you could find a clock synchronizer or a level shifter that would be very interesting too.

P.P.S. Did you find any DFT features? Was DFT even a thing in the 70s?

kens · on Aug 9, 2020

The I/O pins have a resistor and diode to ground as protection. I didn't see any diodes to VCC. The outputs use a chain of successively-larger transistors like modern output cells, to get from the tiny internal current to the large external current.

The pins have a variety of implementations, depending on if they are tri-state, input only, output only, bidirectional, and other factors.

The 8086 has a two-phase clock. It's all one clock domain. The asynchronous input pins (I think reset and interrupt) have clocked latches to synchronize them.

What's DFT?

egsmi · on Aug 9, 2020

DFT = Design For Test https://en.m.wikipedia.org/wiki/Design_for_testing

They have a resistor to ground in parallel with the buffer? I get the diode but I can only assume the resistor is a weak pull to prevent charge building up at the pin. But it seems the diode would mostly do that job, to the point of preventing damage anyway. I guess they weren’t concerned with leakage power then.

Protection resistors in series with the buffer are fairly common these days. Especially on GPIO type pins.

kens · on Aug 9, 2020

The resistor is on series. I.e pin - resistor - diode to ground and circuitry.

The resistor is just some silicon, probably around 30 ohms.

egsmi · on Aug 9, 2020

That makes sense. That’s the current limiting resistor. It protects the protection. See page 12 of this deck.

http://www.eng.biu.ac.il/temanad/files/2017/02/Lecture-9-IO....

jecel · on Aug 8, 2020

Would the latch still work without the clock pass transistors? It seems to me that as long as there is no overlap between hold and load and they are not both low for too long it would store a bit in a controlled fashion. The output would change earlier with no relation to the clock, so that might be a problem (or not) depending on the next circuit.

kens · on Aug 8, 2020

Yes, that's how the 6502's registers work: two inverters with pass transistors for hold and load. As you point out, the tradeoff is that the latch is no longer clocked, so the latch doesn't help you do things synchronously.

burfog · on Aug 8, 2020

The overhead for the two near-useless flags, PF and AF, would be interesting. How much did that junk slow down the clock speed?

kens · on Aug 8, 2020

I'm guessing the impact on clock speed was minimal, but the history of the parity flag is interesting.

Automatically computing parity is very important if you're implementing a terminal, since serial communication often uses a parity bit. The Datapoint 2200 was a programmable terminal / desktop computer introduced in 1970. Earlier than the microprocessor, it had a processor built from a bunch of 7400-series TTL chips.

The Datapoint company asked Intel and Texas Instruments about replacing the board of chips with a single chip. Intel eventually produced the 8008 to duplicate this functionality (including the parity flag), while Texas Instruments produced the TMX 1795 chip. Meanwhile, Datapoint redesigned their processor board around the 74181 ALU chip and decided they didn't want a microprocessor.

Texas Instruments couldn't find a customer for the TMX 1795 and abandoned it, but Intel decided to sell the 8008 as a general-purpose microprocessor, essentially creating the microprocessor market. This led to the Intel 8080 and then the 8086 processors, which because of backwards compatibility kept the parity flag (and the little-endian architecture). So that's why you have the near-useless PF flag.

kevin_thibedeau · on Aug 8, 2020

PF is three levels of XOR gate delays. Not a big deal.

anticensor · on Aug 9, 2020

LEA and scaled indexes will be a good choice.

SomeoneFromCA · on Aug 9, 2020

8086 does not scale indexes afair.

bogomipz · on Aug 9, 2020

I would love to see a post on the instruction decoding phase and and the microcode ROM next.

poseid · on Aug 8, 2020

what about sram?

kens · on Aug 8, 2020

The 8086's register file is built from sram. (With some adjustments to make it multi-ported.) I wrote about that a few weeks ago: http://www.righto.com/2020/07/the-intel-8086-processors-regi...

pansa2 · on Aug 9, 2020

What resources would people recommend to a software developer who wants to understand more about hardware at this level?

How can I go from only knowing the basics to being able to reverse-engineer a schematic from a die photo, as in this article and this recent tweet [0]?

[0] https://twitter.com/gekkio/status/1292206670710480896

userbinator · on Aug 8, 2020

In contrast, compare with the 8008's instruction register which can be seen very prominently in the top middle here:

http://www.righto.com/2016/12/die-photos-and-analysis-of_24....

The 8 orange pieces are polysilicon, forming "bootstrap capacitors" which improve the signal levels when driving the instruction decode PLA below it. That was enhancement-load PMOS, so I guess by the time of the 8086 when depletion loads were already common, the bootstrap had fallen out of favour and superbuffers became preferred instead.

bogomipz · on Aug 9, 2020

The author states:

>"The transistors have complex shapes to make the most efficient use of the space."

What are some of the constraints and factors that go into deciding the individual shapes here? Does each Transistor have to meet a minimum width or gate length?

>"The dynamic latch depends on a two-phase clock, commonly used to control microprocessors of that era."

I thought this was interesting. When exactly did chip makers move away from using two separate clocks?

kens · on Aug 9, 2020

A key factor for a MOS transistor is the gate's width/length ratio, since the current is proportional to this. This ratio is tuned to the particular role of the transistor, usually by adjusting the width. Sometimes the gate zig-zags to fit the width into the available space.

As far as clocks, four-phase clocks were popular for a short time earlier. I don't know details of the clocks in modern processors, so maybe someone can comment.

tannernelson · on Aug 8, 2020

Wow this is amazing