> “For many microcontroller applications, division operations are too infrequent to justify the cost of divider hardware,” explained Himelstein. “The RISC-V Zmmul extension will benefit simple FPGA soft cores in particular.”
This doesn't look like bloat to me. But I have no experience in RISC V.
Despite sounding weird, the multiplication without division is actually very helpful for low cost and resource constrained MCUs/soft-CPUs. Even utilizing the `mno-div` complication flag, division can still occur in precompiled binaries that are linked (because the linker looks at the -march setting). By standardizing the use of multiplication without division we can avoid dealing with this.
As a side note, I ran into this exact problem in a work project last year and needed to implement software division in an illegal instruction handler to get around it.
It's always been legal to say you implement the M extension, but only actually implement multiply in hardware, and provide a trap and emulate solution for the division instructions.
There has long been a -mno-div flag for gcc and clang to tell them not to generate divide instructions (i.e. call a library divide function directly, to save trap overhead) even though the M extension is present. This just provides a way to express that commonly-used functionality via an ISA string instead.
Wait, having no DIV instruction will result in faster code than having a DIV that traps. The compiler picks up the code that happens in the trap and can optimize it. For example, the code for an 8-bit division can be a lot faster than a 32-bit DIV on a microcontroller.
There’s also the overhead of trapping (registers get saved, the CPU may switch to kernel mode). For a single integer division, chances are that adds significant overhead compared to a subroutine call (let alone inlined code)
Yes but that does seem to be better than not standardizing it as an architecture, but you can always fall back on the trap in the absence of being able to link with binaries from the preferred Zmul only architecture.
You're thinking of cellphones and workstations, but Zmul is about microcontrollers, not workstations. The temperature controller in your Pinecil soldering iron might not implement division or have a trap mechanism at all, nor enough Flash to waste some on unused division trap handlers, and there's no point in trying to link a precompiled copy of OpenSSL into its firmware anyway. Nevertheless, if it's RISC-V, you can compile its firmware with GCC or clang. And hardware multiplication is super useful for real-time control systems with, for example, PID controllers in them.
Of course you don't need an OS and a 32-bit computer in your soldering iron, but they are nice.
An OS is nice because you can have multiple real-time tasks with different priority levels, so that you can, for example, update the screen without worrying that it will screw up the PID control loop maintaining the iron's temperature.
You don't need a screen, but a screen is nice because you can see what temperature the iron is set to and what temperature it's measuring. This makes it easier to set the temperature. Also IronOS lets you set the cutoff voltage so you don't kill your batteries by draining them down to zero, and it has an option to correct inaccurate temperature readings from the tip by calibrating the tip.
PID temperature control is nice, not only because the iron is ready for use much more quickly, consumes much less energy (making battery power practical), and weighs less, but also because it allows you to solder things to big copper pours without overheating your iron all the time, which means you burn components and lift traces a lot less often. This is especially nice if a heat-damaged component might not fail until after the circuit board is on orbit, which makes it impractical to replace, or until it's in use in a life-critical application, which makes replacement the least of your worries. But you don't need it.
A 32-bit processor is nice because it means you can use 32-bit math for everything instead of constantly worrying about overflow. Also, you can address the entire 128 KiB of Flash without shitty memory segmentation headaches when you're writing the firmware. 128 KiB of Flash is nice because it's easy for even a single person to write more than 64 KiB of code, even with RV32C's superior code compactness and without using libraries — though, to be fair, an alternative way to get 128 KiB or 256 KiB of code space is to make your code space word-addressed instead of byte-addressed, and then make your instructions 16 bits or 32 bits long. Generally this requires a Harvard architecture, which is common in microcontrollers but not ideal.
None of these nice things requires or even benefits from a division instruction (though in practice they probably do need interrupt handling and illegal-instruction traps, which could be used to emulate division if it were really necessary). So I always thought it was kind of goofy that the M multiply-instruction extension, crucial to DSP work, was saddled with this white elephant of hardware division support. I'm glad to see that's fixed as far as it can be with Zmmul.
The GD32VF103 used in the Pinecil doesn't have memory protection, so IronOS https://github.com/Ralim/IronOS might not be the kind of operating system you're thinking it is. It doesn't have, for example, a filesystem, a command prompt, a GUI, a task list UI, or a networking stack. I don't even think tasks are created and destroyed at runtime; all the running tasks are compiled into the OS image.
Having an OS and a 32-bit microcontroller in your soldering iron might be a terrible idea: for example, it could include malware that displays advertisements or disables your soldering iron if you haven't paid your monthly subscription. The Pinecil solves this problem by being 100% free software, so that even if the IronOS maintainers try to pull something like that, you have the freedom to fork their code and remove the malware.
You still have the problems that the iron can have more complicated malfunctions and is harder to repair than an iron that is just a resistor that plugs into the wall. But I think the advantages of tighter control more than make up for that.
But if you really need to, you can solder your circuits by heating up a chunk of brass in the flame of your gas stove, then pressing it to the circuit board before it cools off. I've done it. But I'd rather not.
I appreciate the thoughtful reply. Yes, maybe not having to contort yourself to do everything with an 8-bit PIC is a nice upgrade in quality of life. I'm just a grumpy old man and grouse at putting the 32-bit mcu and screen forth as a top reason a soldering iron is good. Agree 100% on PID control, which my WES51 does pretty well and is still a good step above the dumb resistive heat firehoses.
Divide being missing is not totally uncommon in CPU architecture. If I'm not mistaken, the Cray-1 didn't have a divide instruction either, just reciprocal approximation.
SH4 (and SH2 I think?) was an interesting mid point, which architecturally exposed the single cycle ops that would back an iterative divide, letting you pick the precision you wanted.
Are there really (enough) RISC V cores having no microcode sequencer at all and would have to add one just to implement a 1 bit per cycle microcoded divide instruction?
I can’t think of many RISC-V cores that are microcode sequenced. There are some toy implementations using ucode - but I can’t think of any that have seen serious use.
If you take a step back and define the concept in question as
'cores that have pipeline stages that can block the pipeline as they work their way through a core local state machine (whether that state machine is encoded in a table form via microcode or via a sea of logic)', then there's quite a few on the low end for multiply support. Remember that RV32EM is competing with Cortex-M0, which can get down to 12k gates when it has an iterative, bit-serial multiply.
Whilst true for multiply, divide is considerably more complex. You have to step up to the M3 to get a divide instruction on the ARM side. This is why the new Zmmul is valuable over the existing M extension.
Agreed, but the point was that state machine, iterative, blocking pipeline stages are pretty common even in RISC cores, including particularly ascetic RISC-V and ARM cores. But I can get how at ~12k gates you don't even want the slowest iterative divide unit.
While it's great that we've started to standardize across architectures on a firmware for servers and workstations, I really wish it hadn't been UEFI. It's well documented, but in my (admittedly limited) experience, a pain to program for with one of the ugliest APIs I've ever used.
I too am disappointed as it feels like were just bolting on Intel baggage to appease commercial interests who could care less about openness or maintainability.
On one hand I want RISC-V to have a firmware standard for their applications processors so we don't have the vendor specific situation we have with Arm. On the other, I do not want Microsoft approved, overly complex Intel contraptions making a mess of an open architecture.
UEFI is a mess, admittedly, with a large attack surface, bad API surface, DOS-like command syntax, it's all pretty bad.
However, it's not as bad as the total free-for-all that booting most Android or ARM devices is. They're like a box of chocolates - you never know what you're going to get. Standardization, even if the standard is poor, is much better in this case than no standard.
Seems like they could have specified that only the "good parts" of UEFI are required. I.e., the minimal, simplest part actually used. After all, on x86,that is the only part that can actually be counted on to work. The rest is shovelware.
>Standardization, even if the standard is poor, is much better in this case than no standard.
Have you heard of Coreboot? UEFI isn't the only game in town but it happens to be the only game that boots modern Microsoft products. MS is a technological burden that needs to go away.
My toyota prius is an insanely complex machine, and yet all it has to do is go forward when I press the accelerator. A UEFI-style over-engineered solution would require me to iterate over AcceleratorProviderProtocols in case someone implemented an acceleration pedal that works differently.
U-Boot is over 2 million lines of code. I don’t know off hand exactly what functionality that does and doesn’t include, but I think it is fair to argue that neither one is especially simple
Love or hate UEFI it's the defacto standard for booting modern hardware and the added support gets us closer to a viable desktop platform. The success of M1 has shown the advantages of a consumer risc platform but the holy grail for many of us is a totally open platform with similar power/performance advantages to arm.
UEFI doesn’t have to be so unwieldy either. You can have UEFI with device tree instead of ACPI, and if that’s sufficient then u-boot will give you a working UEFI environment.
RISC isn't about the number of total instructions, but the complexity of the individual instructions. Consider the ARM "JavaScript instruction" people love to make fun of. FJCVTZS sure sounds like a ridiculous instruction name right out of x86 -- but in reality, it performs a very simple task that greatly speeds up JavaScript code, and it's trivial to implement in hardware but painful in software. If you're going to be doing a task like that frequently it absolutely makes sense to have an instruction for it, and RISC vs. CISC doesn't change that.
What makes an architecture CISC is when the instructions start to do too many things at once. For instance, on x86 you can write something like ‘ADD [rax + 0x1234 + 8*rbx], rcx'. This one instruction involves a bit shift, two additions, a memory load, a third addition, and a memory store (and you can stick on prefix bytes to do even more things). Whereas on a RISC you'd split this operation into four or five separate instructions.
Crucially, the work that the processor has to do doesn't actually change -- the CISC processor is just going to split the instruction apart into RISCy microoperations anyway. So the RISC processor has to do a little more memory access to read all the instructions, but much less decoding work to lower them into the same microcode.
Backing this up, here's an ancient usenet post by John R. Mashey (one of the MIPS designers at SGI) defending HP's use of BCD instructions in their RISC core. https://www.yarchive.net/comp/bcd_instructions.html
HP looked at their actual uses at the time, found a lot of COBOL that required BCD, and included single cycle, register-register instructions to accelerate that use case that was very important to them.
> What makes an architecture CISC is when the instructions start to do too many things at once. For instance, on x86 you can write something like ‘ADD [rax + 0x1234 + 8*rbx], rcx'.
Amusingly, one of the common criticisms against RISC-V is that it doesn't have more complex addressing modes like "base + index << shift". Which other RISC ISA's like ARM or POWER do have, and is quite widely used in application code (think array indexing). This doesn't mean that a RISC CPU must split this up into internal micro-ops, AFAIK at least higher end ARM designs implements it all in a single micro-op by having a small adder and shift unit as part of the memory pipe (adders and limited shifters are really small area-wise).
Indeed, and this is where the usenet post linked in a sibling comment is really informative:
In my opinion, the most common reasonable reason for including some
instructions is:
a) There is a datatype deemed worth supporting, i.e., whose use
burns an interesting number of cycles, either because:
1) It gets used often enough, OR
2) It is used less often, but is really expensive.
b) There is a reasonable hardware cost to support the datatype.
c) Simulating the use of the datatype with the existing instructions
is noticably expensive.
In practice, b)+c) usually mean that the datatype operations can obtain
useful parallelism in hardware, that cannot be gotten at by sequences of
the existing instructions. IF it is difficult to beat sequences of
existing instructions in speed, then only if those sequences are really
frequent would one add new instrucions, i.e.e., for code density.
Adding a little adder and shifter to the memory access unit meets all these conditions -- it's an operation that is effectively free in hardware, but would cost extra cycles & bytes to implement in software.
So the "problem" with x86 isn't that such a complex addressing mode exists. It's that the ADD instruction has such a complex addressing mode, as does nearly every other instruction in the ISA. An 'ADD mem, reg' instruction is really a load, an add, and a store, and there's no hardware magic you can do to accelerate that -- it's just as fast to write "load, add, store" in software, so this fails tests b) and c).
A RISC will typically have, let's say a dozen load/store instructions to support a variety of addressing modes (probably more if you account for atomics and vectors and non-temporals and whatnot). Then you only need one ADD instruction, one SUB instruction, one XOR instruction, maybe one FJCVTZS instruction, etc. But on x86, you get sort of a combinatoric explosion between possible combinations of instructions, addressing modes, and prefix bytes, which means your decoder has to support a dozen ADDs, a dozen SUBs, a dozen XORs, a dozen FJCVTZSes, and whatever.
> which means your decoder has to support a dozen ADDs, a dozen SUBs, a dozen XORs
Does it? The way I've always internalized it is the SIB and MOD/RM are just adding AGU options to the instruction. They're there in the underlying microcode set, and x86 just happens to give me direct access to them through opcode extensions. Is that not how it's actually implemented?
I have no idea how it's implemented under-the-hood, I don't even know what any of those acronyms stand for :)
Regardless of how it's implemented, I don't believe it changes my point that an 'ADD mem, reg' really is at least 3 separate operations, and code density is the only potential benefit to expressing them as a single instruction (on a modern architecture). If you have suggestions as to how I should reword my comment, I'm happy to make edits.
Yes, the address unit handles the shift+add (both on x86 and ARM).
That everything is translated into RISC-like "microcode" is a bit of a meme, it was only really true for the first generation 8086/8088 (from 1978). Those early chips actually had microcode subroutines to compute for example BX+DI and move the result into an internal address register. But every following x86 generation did it in hardware.
Modern CPUs with out-of-order execution do break read/modify/write instructions into separate micro-ops however.
And that's why RISC processors expose more architectural registers than CISC processors.
(A read-modify-write instruction on a CISC processor will still implicitly need a temporary physical register; RISC processors simply expose a larger set of logical registers since a lot of these implicit temporary registers become explicit.)
Nearly everything supports the C-extension. The designers were initially expecting that small cores wouldn’t implement it, but quickly discovered that the code size improvements more than paid for the extra gates
Even really simple MCs are likely going to use compressed instructions. RISC-V apparently takes something like 8k gates. Compressed instructions take around 400 gates or 5% more area in this case.
SRAM need 6 transistors, so just 67 bits or a bit less than 8.5 bytes of storage reduction is enough to make this worth the gate cost. If you have just 1kb of I-cache, a 10% savings or 100 bytes would be 10x the cost of implementing compressed instructions.
In short, I can't think of any reason you'd target a 32-bit microcontroller RISC-V design and not decide to also implement compressed instructions. If you're so sensitive to this cost, there are probably 8/16-bit MCs that would be a better choice.
Most microcontrollers don't have I-cache, and I don't know of any 8/16 bit CPU cores that are significantly smaller on an FPGA than the SeRV implementation of RISC-V, which fits in 200 4-LUTs. Do you?
PicoBlaze in particular is about the same size as SeRV (96 6-LUTs I think) and really limited even compared to other 8-bit cores. It's a lot faster than SeRV, though, and has denser code—it has to, since its maximum program size is 1024 instructions. SeRV's RISC-V support lets you compile your code with GCC or clang.
I do agree with your broader point that the C extension trades off a bit of decode complexity against a tiny amount of program memory size, even if that's mask ROM or something instead of SRAM I$, so there are not many cases where omitting the C extension is a good tradeoff. Just, I'm not sure the architectural data word width really bears on this.
I-cache was just one example that most people find easy to understand.
The original paper on compressed instructions claims 25-30% reduction in total instruction size. A typical small M0 microcontroller will have something like 4-32kb of RAM and 32kb to 1MB of flash storage.
No matter how you slice it, saving 30% of your instruction space is a HUGE deal. Even with EEPROM (usually used for flash memory) taking just 2 transistors per bit, we're still talking about compressed instructions taking up 200 bits, 40 bytes, or just TEN 32-bit instructions in your flash memory.
With that 30% optimistic number, any program with more than 30 instructions will have a net benefit from compressed instructions. Even my very pessimistic 10% number would mean just 100 instructions before you break even.
While I agree with the principle and overall conclusion I think you're probably confusing gates and transistors. While you can make a 2-input NAND gate on a breadboard using one transistor and some resistors, in a CMOS chip resistors are more expensive than transistors (and suck energy) so actually a 2-input NAND gate will use 4 transistors, a 3-input NAND will use 6 transistors etc.
You might be confusing LUTs and gates also.
Regardless, by the time you have a couple of KB of ITIM or icache or maybe 10 KB of code executed directly from flash, you're better off implementing C.
On anything running a real OS it's just ludicrous not to have C.
RVC is designed such that all compressed instructions expand to a single full length instruction. So it’s really only the decode stage and the PC alignment that change.
>RVC is an ISA extension and not all implementations support it.
I have never seen a commercially-produced RISC-V core that doesn't include the C extension. It's simply too cheap and too useful to leave out. You make back the cost of the C decoder in probably the first 1 or 2 KB of ITIM/icache.
The only things that don't have the C extension are educational examples/student projects and extremely simple cores for FPGA uses where they probably run controller code with just a few dozen instructions.
You'll certainly never see anything with 64 bit, an MMU, or an FPU that doesn't have the C extension. You could do it, but it's stupid.
It was always the goal to support everything from micro to hpc. So more extensions were always expected. However the R in RISC revers to the how many instruction you have to implement to get a job done.
If you write for a microcontroller, RISC-V is just as small now as it was when it started.
> However the R in RISC revers to the how many instruction you have to implement to get a job done.
right, the way i always understood it was, in many situations there are different riscs that are appropriate (e.g i only need 32-bit integer and atomic instructions for job x) so instead of accumulating larger and larger single spec, you can pick and choose whats needed:
you can have still have your risc (fewer, simpler, efficient instruction) cake and eat it (deploy it in various divergent situations and still be standard) too
RISC fell by the wayside decades ago. Two of them.
It was really only ever a design touchstone, a reminder that simple operations are easier to fit into a short clock cycle, and complex instructions are not necessarily faster than a series of simple ones. Since the time RISC still was considered a plausible thing, we got cache footprints, and micro-ops in a cache of their own.
What matters, ultimately, is performance. RISC was once a means to get performance. Nowadays we do all kinds of crazy shit to get performance, very little of it compatible with what RISC was supposed to enable.
I want to see every extension be forwards and backwards compatible.
Ie. a binary should run fine with or without a processor extension.
To make that happen, every processor extension should either be implemented in hardware, or be emulated by software.
None of these extensions to the ISA have software implementations. Sure, someone could write one, but I don't think the RISC-V standards group should allow any extensions to make it into silicon without a software implementation being published and freely available.
The RISC-V group should publish an official binary (and source) which can trap and emulate every possible extension.
Requiring a software fallback to be present on every system would IMO kill the platform for small-ish CPUs. For example, why would a CPU with a few kilobytes of RAM need to support vector extensions? Why would it have to support UEFI if it doesn’t have or support disks? For CPUs intended to be used in small embedded systems, supporting either easily could make the necessary amount of ROM ten times as large, and thus significantly increase their cost.
Also, having source code to emulate functionality may be nicer, but publishing both source code and a spec will run into problems when the two aren’t compatible with each other.
Finally, I’m not sure it’s possible to emulate all extensions using only the minimal subset of RISC-V instructions. Examples are support for atomic instructions, hypervisors, instruction fences, transactional memory, weak memory ordering, or user-level interrupts.
You can't allow for forward compatibility without sacrificing performance. If a chip knows that there is a fixed set of opcodes, it can specialize decoding to only decode the opcodes it knows about.
> If a chip knows that there is a fixed set of opcodes, it can specialize decoding to only decode the opcodes it knows about.
What does that mean? The CPU still needs to trap on undefined ones. I don't think the spec even allows non-trapping behavior[a]. So there's still some kind of decode happening. And if each instruction is 32 bits, you don't need any logic to find the instruction boundary on an undefined one.
[a]: By "non-trapping", I mean behavior akin to what old processors would do where every opcode would do something because of gaps in the PLAs (such as 0xAF on the 6502 performing 0xAD (LDA abs) and 0xAE (LDX abs) at the same time)
The RISC-V spec does allow non-trapping behavior and SeRV in particular has non-trapping behavior, which is an important part of how it can fit into 200 4-input LUTs. Trap handling is also not part of the base ISA.
that sounds like a fun project! a suite of minimal m-mode code that implements your selection of “couldn’t bother to have hardware”. want to work on it with me? :)
I have thought that it would be good to have M-mode emulation available of everything except rv32i/rv64i. The AMO operations are a prime candidate. Very annoying for a simple implementation to have to implement read-modify-write in hardware. You can emulate AMOs easily using LR/SC, and in simple single-core systems those are no-ops anyway.
It would be handy to have a special trap vector for illegal instruction, to save a few instructions checking mcause in the normal trap handler. It would also be awesome to have some special CSRs that presented the contents of the rs1, rs2, rs3 encoded in the instruction and another CSR you could write to that put the value into rd. That would save a whole lot of mucking about extracting the register specifier fields from the opcode yourself, and then having to use a switch to read or write each register (or, worse, dump all registers into RAM and then access them as an array).
You might think you'd want another CSR that holds the decoded value of the literal/offset in I, S, SB, U, UJ instruction formats -- but you probably don't want to be emulating any of the basic I ISA. Maybe LB/LBU/SB/LH/LHU/SH? Those can be a pain, and are seldom used.
I'm not sure that slightly more complex ucontroller archs with DMA engines require AMOs to be coherent with other masters like DMA engines. You normally do a song and dance where you just guarantee that you don't need atomic RMW in the first place, handing off filled in descriptors to the peripherals and getting interrupts on completion.
But interrupts themselves complicate AMOs making them not nops and needing a CLI/RMW/STI sequence.
Enough of a pain that SiFive cores take a whole extra cycle for LB/LH (and LW on 64 bit) from L1 cache compared to loading a full register e.g. 3 vs 2.
1. UEFI
2. Supervisor Binary Interface
3. Efficient Trace/Debug
4. Zmul only. This is the one that confused me.
> “For many microcontroller applications, division operations are too infrequent to justify the cost of divider hardware,” explained Himelstein. “The RISC-V Zmmul extension will benefit simple FPGA soft cores in particular.”
This doesn't look like bloat to me. But I have no experience in RISC V.