Hey folks, author here. Appreciate the comments and discussion on the post! Happy to continue the discussion or answer any follow-on questions folks have about our investigation and resolution.
Would just be easier to implement an open source one from scratch given that most IOT modems are just (poorly written) software with an SDR anyway. CEVAwaves or RivieraWaves certainly is.
If it can be done for BLE, 802.15.4 and WiFi, it can be done for cellular; ORAN is enough for base-stations so it's not totally secret. But at the end of the day, aside from the stupid ITU protocols, LTE/5g isn't rocket science. It's bog standard OFDMA radio -- nothing fancy these days.
I ended up reverse engineering parts of Sony's Altair derived modems but got bored, maybe I should publish that
Good read thanks. To the point of having a open platform, they would probably lose their monopoly. I could then just spin up my own mobile network and provide service like a wifi network with any credentials and do what I want.
I guess there needs to be a big initial effort with lots of maintaining afterwards. If everything goes as expected the AIs could provide that in the future.
Not OP, but I’ve dabbled in nRF91 recently and found that once your application starts doing anything interesting (MCUboot, OTA, softSIM, etc.) the code size explodes. It is particularly difficult to keep TFM down to a manageable size. 1 MB of flash really doesn’t go that far these days.
Years ago I worked on the nRF52 series with the “old” SDK and felt I had much greater control. I understood the build system. Maybe I’m just grumpy and don’t like change…
That's a Zephyr thing. Same on STM32, add an otherwise trivial driver for some peripheral that has a bunch of dependencies, and then your codesize explodes by 60KB.
Nordic is one of the largest contributors to Zephyr though. I get the feeling that they are pushing hard to make it the de-facto RTOS for embedded stuff.
I feel like the whole Zephyr ecosystem is geared towards reducing "time to blinky on every devkit you have" at the expense of mid- to late-stage development efforts. Driver maintainers want their stuff to work out of the box, so they enable _everything_ and code size becomes the end customer's problem.
grumble grumble, I don't like where this is heading.
Great pointer! My sibling post in this thread references a few other blog entries where we have detailed using eDRX and similar low power modes alongside Connection IDs. I agree that many devices don't need to be immediately responsive to cloud to device communication, and checking in for firmware updates on the order of days is acceptable in many cases.
One way to get around this in cases where devices need to be fairly responsive to cloud to device communication (on the order of minutes) but in practice infrequently receive updates is using something like eDRX with long sleep periods alongside SMS. The cloud service will not be able to talk to the device directly after the NAT entry is evicted (typically a few minutes), but it can use SMS to notify the device that the server has new information for it. On the next eDRX check in, the SMS message will be present, then the device can ping the server, and if using Connection IDs, can pull down the new data without having to establish a new session.
Is "Non-IP Data Delivery" (basically SMS but for raw data packets, bound to a pre-defined application server) already a thing in practice?
In theory, you get all the power saving that the cellular network stack has to offer without having to maintain a connection. While on protocol layer NIDD is almost handled like an SMS (paging, connectionless), it is not routed through a telephony core (and hence sloooow). The base station / core will directly forward it to your predefined application server.
It has been heavily advertised, but its support is inconsistent. If you are deploying devices across multiple regions, you likely want them to function the same way everywhere.
An ad for... the IETF? All of the firmware discussed in this post is open source, and we even contributed DTLS Connection ID server side support to a popular open source library [0] so other folks can stand up secure cloud services for low power devices. Sure, we sell a product, but our broader mission is making the increasing number of internet connected devices more secure and reliable. When sharing knowledge and technology is in service of that mission, we do not hesitate to do so.
Author here -- thanks for engaging in the discussion! You won't find any pushback from us on using Zephyr -- we are contributors, the firmware example in the post is using it (or Nordic's NCS distribution of it), and we offer free Zephyr training [0] every month :)
Author of this post here -- thanks for sharing your experience! One thing I'll agree with immediately is that if you can afford to power down hardware that is almost always going to be your best option (see a previous post on this topic [0]). I believe the NAT post also calls this out, though I believe I could have gone further to disambiguate "sleeping" and "turning off":
> This doesn’t solve the issue of cloud to device traffic being dropped after NAT timeout (check back for another post on that topic), but for many low power use cases, being able to sleep for an extended period of time is more important than being able to immediately push data to devices.
(edit: there was originally an unfortunate typo here where the paragraph read "less important" rather than "more important")
Depending on the device and the server, powering down the modem does not necessarily mean that a session has to be started from scratch when it is powered on again. In fact, this is one of the benefits of the DTLS Connection ID strategy. A cellular device, for example, could wake up the next time in a completely different location, connect to a new base station, be assigned a fresh IP address, and continue communication with the server without having to perform a full handshake.
In reality, there is a spectrum of low power options with modems. We have written about many of them, including a post [1] that followed this one and describes using extended discontinuous reception (eDRX) [2] with DTLS Connection IDs and analyzing power consumption.
Author here. I’ve got a few more posts on VPR coming in the next couple of weeks. If you have any requests for deep dives into specific aspects of the architecture, feel free to drop them here!
Thank you so much for asking, I have oh so many requests...
Personally, I'm mostly interested in the ARM vs RISCV compare and contrast.
- I'd be very interested in comparing static memory and ram memory requirements for programs that are as similar as you can make them at the c-level using whatever toolchain Nordic wants you to use.
- Since you're looking to do deep dives I think looking into differences in the interrupt architecture and any implications on stack memory requirements and/or latency would be interesting, especially as VPR is a "peripheral processor"
- It would be interesting to get cycle counts for similar programs between ARM and RISCV. This might not be very comparable though as it seems the ARM architectures are more complex thus we expect a lower CPI from them. Anyway I think CPI numbers would be interesting.
The Raspberry Pi Pico 2 of course also uses the Cortex M33, along with a self-developed (in his spare time!) RISC-V core that has very similar performance, other than not having an FPU.
It's pretty easy to compare the same C code on both CPUs on a Pico 2, where you have equal RAM, equal peripherals etc.
Why did they go with the 64 bit Arm core instead of an RV64 core? (Or an alternative question: why go with the 32 bit RISC-V core instead of an Arm M0?)
Does having mixed architectures cause any issues, for example in developer tools or build systems? (I guess not, since already having 32 vs 64 bit cores means you have effectively a "mixed architecture" even if they were both Arm or RISC-V)
What's the RISC-V core derived from (eg. Rocket Chip? Pico?) or is it their own design?
They haven't gone with a 64Bit ARM core. The ARMv8*M* isn't 64bit unlike ARMv8R and ARMv8A (the nomenclature can get confusing). The differences between ARMv7M (especially with the optional DSP and FPU extension) and ARMv8M mainline are fairly minor unless you go bigger with an M55 or M85 which (optionally) adds the Helium SIMD extension. At he low end ARMv8M baseline adds a few quality of life features over ARMv6M (e.g. the ability to reasonably efficiently load large immediate values without resorting to constant pool). Also the MPU got cleaned up to make it a little less annoying to configure.
ARMv8A and ARMv8R can both be pure 32 bit as well, incidentally -- e.g. Cortex-A32 and Cortex-R52. v8A added 64 bit, but it didn't take away 32 bit. It's not until v9A that 32 bit at the OS level was removed, and even there it's still allowed to implement 32 bit support for userspace.
M33 has, among other things, TrustZone. So there’s some feature, along with developer familiarity and tools that make ARM desirable for an application processor.
Mixed architecture doesn’t really cause any significant problems.
Will open-source developers unable or unwilling to sign an NDA get access to a toolchain to run their own code on the RISC-V co-processors? Is the bus matrix documented somewhere? Does the fast co-processor have access to DMA engines and interrupts?
FYI Nordic said on their YouTube channel that the RISC-V toolchain that already ships with Zephyr's SDK will support the cores. See around 00:56:32.520 [1]
When more general purpose hardware (i.e. CPU cores) are added to chips like this it is to replace the need for single purpose devices. True nightmarish complexity comes from enormous numbers of highly specific single purpose devices which all have their own particular oddities.
There was a chip a while back which took this to a crazy extreme but threw out the whole universe in the process https://www.greenarraychips.com/
Not wrong, especially for microcontrollers where micro/nanosecond determinism may be important - software running on general purpose cores is not suitable for that. They can also be orders of magnitude more energy efficient than running a full core just to twiddle some pins.
I’ve got a project that uses 4 hardware serial modules, timers, ADC, event system etc all dedicated function. Sure, they have their quirks but once you’ve learnt them you can reuse a lot of the drivers across multiple products, especially for a given vendor.
Of course there is some cost, but it’s finding the balance for your product that is important.
> They can also be orders of magnitude more energy efficient than running a full core just to twiddle some pins.
This used to be true, but as fabrication shrinks first you move to quasi FSMs (like the PIO blocks) and eventually mini processors since those are smaller than the dedicated units of the previous generation. When you get the design a bit wrong you end up with the esp32 where the lack of general computation in peripherals radically bumps memory requirements and so the power usage.
This trend also occurs in GPUs where functionality eventually gets merged into more uniform blocks to make room for newly conceived specialist units that have become viable.
No, still true - you’re never going to beat the determinism, size, and power of a few flops and some logic to drive a common interface directly compared to a full core with architectural state and memory. E.g., just to enter an interrupt is 10-15 odd cycles, a memory access or two to set a pin, and then 10-15 cycles again to restore and exit.
Additionally, micros have to be much robust electrically than a cutting edge (or even 14 nm) CPU/GPU and available for extended (decade) timespans so the economics driving the shrink are different.
Small, fast cores have eaten the lunch of e.g. large dedicated DSP blocks for sure but those are niche cases where the volume is low so eventually the hardware cost and cost to develop on weird architectures costs more than running a general purpose core.
> No, still true - you’re never going to beat the determinism, size, and power of a few flops and some logic to drive a common interface directly compared to a full core with architectural state and memory.
But you must know what you intend to do when designing the MCU, and history shows (and some of the questioning here also shows) that this isn’t the case. As you point out expected lifespans are long, so what is a designer to do?
The ESP32 case is interesting because it comes so close, to the point I believe the RMT peripheral probably partly inspired the PIO, thanks to how widely it has been used for other things and how it breaks.
The key weakness of the RMT is it expects the conversion of the data structures to be used to control it to be prepared in memory already, almost certainly by the CPU. This means that to alter the data being sent out requires the main app processor, the DMA and the peripheral to be involved, and we are hammering the memory bus while doing this.
A similar thing occurs with almost any non trivial SPI usage where a lot of people end up building “big” (relatively) buffers in memory in advance.
Both of those situations are very common and bad. Assuming the tiny cores can have their own program memory they will be no less deterministic than any other sort of peripheral while radically freeing up the central part of the system.
One of the main things I have learned over the years is people wildly overstate the cost of computation and understate the cost of moving data around. If you can reduce the data a lot at the cost of a bit more computation that is a big win.
> But you must know what you intend to do when designing the MCU, and history shows (and some of the questioning here also shows) that this isn’t the case. As you point out expected lifespans are long, so what is a designer to do?
Designers do know that UARTs, SPIs, I2C, timers etc will be around essentially forever. Anything new has to be so much faster/better, the competition being the status quo and its long tail, that you would lay down a dedicated block anyway.
I think we'll disagree, but I'm not convinced by many of the cases given here (usually DVI on an RP2040...) as you would just buy a slightly higher spec and better optimised system that has the interface already built in. Personal opinion: great fun to play with and definitely good to have a couple to handle niche interfaces (e.g. OneWire), but not for majority of use cases.
> A similar thing occurs with almost any non trivial SPI usage where a lot of people end up building “big” (relatively) buffers in memory in advance.
This is neither here nor there for a "PIO" or a fixed function - there has be state and data somewhere. I would rather allocate just what is needed for e.g. a UART (on my weapon of choice, that amounts to a heady 40 bits local to the peripheral written once to configure it, overloaded with SPI and I2C functionality) and not trouble the memory bus other than for data (well said on data movement, burns a lot and it's harder to capture).
> Assuming the tiny cores can have their own program memory they will be no less deterministic than any other sort of peripheral while radically freeing up the central part of the system.
Agreed, only if it's dedicated to a single function of course otherwise you have access contention. And, of course, we already have radically freed up the central part of the system :P
If you have a programmable state machine that's waiting for a pin transition, it can easily do the thing it's waiting to do in the clock cycle after that transition. It doesn't have to enter an interrupt handler. That's how the GA144 and the RP2350 do their I/O. Padauk chips have a second hardware thread and deterministically context switch every cycle, so the response latency is still less than 10–15 cycles, like 1–2. I think old ARM FIQ state also effectively works this way, switching register banks on interrupt so no time is needed to save registers on interrupt entry, and I think the original Z80 (RIP this year) also has this feature. Some RISC-V cores (CH32V003?) also have it.
An alternate register bank for the main CPU is bigger than a PWM timer peripheral or an SPI peripheral, sure, but you can program it to do things you didn't think of before tapeout.
Making I/O properly programmable actually reduces complexity overall, because you can put more of the customizability on the other side if the interface, making things much simpler overall. I2C, for example, is a terrible interface in many ways, but one of the biggest is it creates a very complex and low-latency interface between the hardware and the software, and it's often easier to bitbang than use dedicated peripherals, especially the buggier ones. Running it on a small dedicated core means you can deal with it much more sensibly than trying to wrangle a hard peripheral which can't make enough assumptions about your use-case to give a good interface.
The article goes into more detail than it strictly needs to because the purpose is educational. However, a lot of what it's presenting is simplified interfaces and relevant details rather than the true complexity of the whole.
Modern hardware is just fundamentally complex, especially if you want to make full use of the particular features of each platform.
[0] https://euer.krebsco.de/a-software-kvm-switch.html