Zephyr has extensive built-in support for a multiverse of microcontrollers, development boards, and sensors. This is possible because of an abstraction layer that allows anyone to hook their own devices into the system. However, there are a few bits of core knowledge you need to get everything working just right. Let’s work our way through those and discuss how to write a Zephyr device driver!

Overview of Zephyr Device Drivers

You will want to implement most of these pieces to get your device driver up and running:

  • Devicetree binding to define a “compatible” for your device
  • Kconfig symbol to include or exclude the driver from the build
  • Power management support to move between power modes (on, standby, sleep, etc.)
  • A data structure for per-instance data storage
  • An API so user applications may access the driver

None of this is particularly complex, but as a whole, it can be daunting to figure out where to start and how to troubleshoot when something isn’t working correctly.

In preparation for this post, I converted the Golioth Ostentus library (libostentus) into a proper Zephyr device driver. Ostentus is an open source hardware faceplate that adds a user interface to an embedded project using i2c. While it certainly worked before this change, we get a few nice bonuses by making it a driver:

  • The device is now added as a devicetree node
  • The library is automatically selected when a devicetree node is preset
  • The driver will automatically initialize the device before the application begins running
  • Multiple instances of the device may now be included in a single build

Let’s dig in!

Optional Prerequisite: How to Write a Zephyr Module

Zephyr device drivers may be included in your application directory. But in this case, we want to use the Ostentus in numerous Zephyr projects. To accomplish this, we need to make the driver a Zephyr Module. This means it will live in its own git repository and be included in the west manifest of projects that use it.

I’ve previously written about this process. If you need a refresher, check out our post on How to Turn Helper Code into a Zephyr Module.

Tree Overview

.
├── CMakeLists.txt
├── dts
│   └── bindings
│       ├── golioth,ostentus.yaml
│       └── vendor-prefixes.txt
├── include
│   └── libostentus.h
├── Kconfig
├── libostentus.c
└── zephyr
    └── module.yml

We’ll be jumping back and forth through files during this post. For your reference, this tree contains all the files we’ll touch along the way.

1. Create the Binding and add it to the Zephyr Module

Ostentus is an i2c device with no other special considerations. That makes the binding really simple because we just need to include the default i2c binding.

golioth	Golioth

description: "Golioth Ostentus Faceplate"

compatible: "golioth,ostentus"

include: [i2c-device.yaml]

The directory structure includes two files in dts/bindings. Since Golioth isn’t in Zephyr’s existing list of hardware vendors, it’s added to the vendor-prefixes.txt file. (Note that the syntax for this file requires a tab character between the vendor prefix and the vendor name.)

The binding itself uses the <vendor>,<device>.yaml naming convention. That file defines the golioth,ostentus compatible and (as already mentioned) includes the existing i2c device binding.

One place I struggled was in getting Zephyr to properly ingest this binding. Because this is a module, we need to specify a dts_root in zephyr/module.yml so that it will look for our dts directory:

build:
  cmake: .
  kconfig: Kconfig
  settings:
    dts_root: .

2. Set Kconfig to Automatically Enable the Driver

You’ll know your changes made in step 1 are working because a project built with a golioth,ostentus compatible in the devicetree will result in the following Kconfig symbol in build/zephyr/.config:

CONFIG_DT_HAS_ARM_V8M_NVIC_ENABLED=y
CONFIG_DT_HAS_FIXED_PARTITIONS_ENABLED=y
CONFIG_DT_HAS_GOLIOTH_OSTENTUS_ENABLED=y
CONFIG_DT_HAS_GPIO_KEYS_ENABLED=y
CONFIG_DT_HAS_GPIO_LEDS_ENABLED=y

Neat, right? The symbol appears automatically, based on the CONFIG_DT_HAS_<VENDOR>_<DEVICENAME>_ENABLED syntax from the compatible that was defined. This is useful because we can depend upon it to add the library to the build.

menuconfig LIB_OSTENTUS
    bool "Enable the driver library for the Golioth Ostentus faceplate"
    default y
    depends on DT_HAS_GOLIOTH_OSTENTUS_ENABLED
    select I2C
    help
      Helper functions for controlling the Golioth Ostentus faceplate.
      Features include controlling LEDs, adding slides and slide data,
      enabling slideshows, etc.

if LIB_OSTENTUS

config OSTENTUS_INIT_PRIORITY
    int "Ostentus init priority"
    default 90
    help
      Ostentus initialization priority.

config OSTENTUS_LOG_LEVEL
    int "Default log level for libostentus"
    default 4
    help
        The default log level, which is used to filter log messages.

        0: None
        1: Error
        2: Warn
        3: Info
        4: Debug
        5: Verbose

endif #LIB_OSTENTUS

This Kconfig file adds the LIB_OSTENTUS symbol, but only if DT_HAS_GOLIOTH_OSTENTUS_ENABLED is present. In this case, the library symbol was added as a menu with two additional symbols used to set the log level and the initialization priority.

3. Define Typedefs and Custom Device API

Now the real work begins.

This section is all about writing a custom device API. If all you’re after is adding your own sensor to Zephyr, you can pretty much skip this section because all the work has been done for you in include/zephyr/drivers/sensor.h. That’s just one in-tree API you can choose from, so if any of them fit your needs please use one of those.

The Ostentus doesn’t fit into any of the existing APIs so we need to create our own. This happens in a header file named for your driver and placed in the include directory of your driver repository. First, make a typedef that reflects the parameter fingerprint of all the functions you want to call as part of your driver.

typedef int (*ostentus_cmd_t)(const struct device *dev);
typedef int (*ostentus_setval_8_t)(const struct device *dev, uint8_t val);

Now use those typedefs to declare your API.

__subsystem struct ostentus_driver_api {
    ostentus_cmd_t ostentus_clear_memory;
    ostentus_setval_8_t ostentus_led_power_set;
};

This prepares a driver API for use when we define the device instances. In reality there are a couple dozen functions in our actual API that use less than a dozen typedefs. Here’s the relevant code if you’re interested in seeing everything.

4. Define Syscalls and Inline Functions

Now that we have an API, we need inline functions that will call the functions associated with that API.

You have a choice to make these regular functions, or syscall functions. Zephyr offers a User Mode which sandboxes the application. If you want your driver to work for User Mode applications, you need to implement them as syscalls.

__syscall int ostentus_clear_memory(const struct device *dev);

static inline int z_impl_ostentus_clear_memory(const struct device *dev)
{
    const struct ostentus_driver_api *api = (const struct ostentus_driver_api *)dev->api;
    if (api->ostentus_clear_memory == NULL) {
        return -ENOSYS;
    }
    return api->ostentus_clear_memory(dev);
}

__syscall int ostentus_led_power_set(const struct device *dev, uint8_t state);

static inline int z_impl_ostentus_led_power_set(const struct device *dev, uint8_t state)
{
    const struct ostentus_driver_api *api = (const struct ostentus_driver_api *)dev->api;
    if (api->ostentus_led_power_set == NULL) {
        return -ENOSYS;
    }
    return api->ostentus_led_power_set(dev, state);
}

The __syscall directive is used in the function prototype, then when defining the function the z_imp_ is used to prefix the name of the API call. Note the purpose of this inline function is to check that a function was assigned to this API call (we’ll do that in step 6 below), before passing the parameters to that function.

To finish setting up the syscalls we need to add a special include to the end of this file. That include uses the #include <syscalls/[NameOfThisHeaderFile]> format. We also need to tell CMake that this file uses syscalls.

#include <syscalls/libostentus.h>
zephyr_syscall_header(${ZEPHYR_LIBOSTENTUS_MODULE_DIR}/include/libostentus.h)

Once again, there are far more functions defined in the actual driver, which you can see for yourself by viewing the actual header file.

5. Implement the Driver Functions and Assign to the API

Technically, the header file we created in steps 3 and 4 is a generic API that may be reused by any number of different devices. Now we can implement one such device. We’ll use the libostentus.c file to write the device-specific functions, then assign them to our API calls.

#include <libostentus.h>
#include <libostentus_regmap.h>
#include <zephyr/drivers/i2c.h>

static int ostentus_i2c_write2(const struct device *dev, uint8_t reg, uint8_t *data1,
                   uint8_t data1_len, uint8_t *data2, uint8_t data2_len)
{
    const struct ostentus_config *config = dev->config;

    struct i2c_msg msgs[] = {
        {
            .buf = &reg,
            .len = 1,
            .flags = I2C_MSG_WRITE,
        },
        {
            .buf = data1,
            .len = data1_len,
            .flags = I2C_MSG_WRITE,
        },
        {
            .buf = data2,
            .len = data2_len,
            .flags = I2C_MSG_WRITE | I2C_MSG_STOP,
        },
    };
    uint8_t num_msgs = ARRAY_SIZE(msgs);

    /* Detect how many i2c messages there are and which is the last one */
    for (int i = 1; i < ARRAY_SIZE(msgs); i++) {
        if (!msgs[i].len) {
            msgs[i - 1].flags |= I2C_MSG_STOP;
            num_msgs = i;
        }
    }

    return i2c_transfer_dt(&config->i2c, msgs, num_msgs);
}

static int ostentus_i2c_write1(const struct device *dev, uint8_t reg, uint8_t *data,
uint8_t data_len)
{
    return ostentus_i2c_write2(dev, reg, data, data_len, NULL, 0);
}
static int ostentus_i2c_write0(const struct device *dev, uint8_t reg)
{
    return ostentus_i2c_write2(dev, reg, NULL, 0, NULL, 0);
}

static int clear_memory(const struct device *dev)
{
    return ostentus_i2c_write0(dev, OSTENTUS_CLEAR_MEM);
}
static int led_power_set(const struct device *dev, uint8_t state)
{
    uint8_t byte = state ? 1 : 0;
    return ostentus_i2c_write1(dev, OSTENTUS_LED_POW, &byte, 1);
}

This file begins with three functions that handle writing to the device using i2c that aren’t defined in the API. The two functions at the bottom of the file receive device structs (and all other parameters) in a way that matches the typedefs created in step 3. These two functions pass the device struct to the i2c functions to communicate with the device.

Now it’s time to associate these functions with the API.

static const struct ostentus_driver_api ostentus_api = {
    .ostentus_clear_memory = &clear_memory,
    .ostentus_led_power_set = &led_power_set,
};

Once again, this is a greatly simplified version of the actual API definition.

6. Define the Device Instances

There’s a lot happening in this set but we’re almost done! To tie everything together we must declare a driver compatible and handle the data, configuration, power management, and initialization. All of these parts are tied together with a bit of “macrobatics“.

Declare a Driver Compat

This is incredibly important and easy to miss. Declare a driver compatible that matches your devicetree binding in your c file:

#define DT_DRV_COMPAT golioth_ostentus

Device Data

This device has no need for persistent data. We could do something like store the firmware version the Ostentus faceplate reports, but that can just be read and printed during initialization with no need for storage.

To learn more about handling data, check out any of the sensor drivers in the Zephyr tree for data struct and data initialization.

Power Management

We have not yet implemented power management for this device. Future work might include sending a command that puts the Ostentus in sleep mode, and another to wake it up again.

Examples of power management are available in the Zephyr tree sensor drivers.

Configuration

Configuration info is basically a context for each device instance. This is where the driver will store the i2c bus and address info for Ostentus. The struct is defined in the driver header file.

struct ostentus_config {
    struct i2c_dt_spec i2c;
};

Initialization

The driver will automatically initialize the device, but we must supply the initialization function.

static int ostentus_init(const struct device *dev)
{
    const struct ostentus_config *config = dev->config;

    if (!device_is_ready(config->i2c.bus)) {
        LOG_ERR("I2C bus device not ready");
        return -ENODEV;
    }

    char buf[32];
    int err = version_get(dev, buf, 32);
    if (err) {
        LOG_ERR("Unable to communicate with Ostentus over i2c: %d", err);
        return err;
    } else {
        LOG_INF("Ostentus firmware version: %s", buf);
    }

    return 0;
}

This function gets the i2c bus from the associated config struct and tests to make sure everything is kosher. It then reads and logs the firmware version from the device.

Macros for Device Instances

Now use macros to tie everything together at the bottom of the C file.

#define OSTENTUS_DEFINE(inst)                                      \
    static const struct ostentus_config ostentus_config_##inst = { \
        .i2c = I2C_DT_SPEC_INST_GET(inst),                         \
    };                                                             \
                                                                   \
    DEVICE_DT_INST_DEFINE(inst,                                    \
                  ostentus_init,                                   \
                  NULL,                                            \
                  NULL,                                            \
                  &ostentus_config_##inst,                         \
                  POST_KERNEL,                                     \
                  CONFIG_OSTENTUS_INIT_PRIORITY,                   \
                  &ostentus_api);

DT_INST_FOREACH_STATUS_OKAY(OSTENTUS_DEFINE)

We define a macro that populates the member of the config struct using devicetree information. (This would also be where you would populate data and power management if you have them.)

The DEVICE_DT_INST_DEFINE function passes in the init function, config struct, power management (NULL), data struct (NULL), initialization level, initialization priority, and the address of the API struct. The final macro calls our mega-macro once for each instance of a device encountered in the devicetree.

Using Your Device Driver

So, how do you use this whole thing? It’s very similar to using a sensor in Zephyr. In our case we need to first include the module in west.yml.

manifest:
  projects: 
    - name: libostentus
      path: deps/modules/lib/libostentus
      revision: v2.0.0
      url: https://github.com/golioth/libostentus

Add an instance of Ostentus to the devicetree.

&i2c2 {
    /* Needed for I2C writes used by libostentus */
    zephyr,concat-buf-size = <48>;

    ostentus@12 {
        status = "okay";
        compatible = "golioth,ostentus";
        reg = <0x12>;
    };
};

And then interact with the device in your application:

#include <libostentus.h>

static const struct device *ostentus = DEVICE_DT_GET_ANY(golioth_ostentus);

static int some_function(void)
{
    ostentus_clear_memory(ostentus);
    ostentus_led_power_set(ostentus, 1);
}

Going Deeper

There’s a lot here to digest. While this is a nice walkthrough, the full code is worth your review. All Golioth hardware is open source and that includes the libostentus driver library used as the example in this post.

In 2022 I attended a fantastic talk on custom drivers presented by Gerard Marull Paretas at the Zephyr Developer’s Summit. You can watch the talk recording and also peruse the sample code from that talk. I’d like to extend a personal thank you to Gerard for such an excellent presentation!

What are you building? We’d love hear about the devices for which you’re creating drivers. Start a thread in the Golioth Forum to share the progress of your work!

Golioth’s own Dan Mangum presented a talk at this year’s Embedded Open Source Summit detailing how to use WebAssembly with Zephyr RTOS. For those unfamiliar with WebAssembly, it was conceived as a replacement for JavaScript. So what is it doing in microcontrollers? Dan takes on that question, and covers how to validate whether Wasm on Zephyr is a viable solution for you.

What is WebAssembly?

WebAssembly–aka Wasm–is a portable binary format that can be executed on myriad different systems and architectures. Platforms that support Wasm have a runtime that makes execution possible and this is the case for Zephyr.

The WebAssembly Micro Runtime (wamr for those in the know) already has a Zephyr port that you can try out right now. Wamr delivers a runtime optimized for embedded systems that sandboxes the the Wasm code it is running.

Just build the runtime into your firmware, then supply a new Wasm binary whenever you want to change how that part of the application works. You now have a way to update programs in a safe way without a full firmware update and even without rebooting the hardware.

Why Use Wasm with Zephyr (or any embedded system)?

Dan spends the first half of his talk discussing the criteria used to evaluate tradeoffs in play with WebAssembly. You’re always going to use more resources and take a speed hit compared to native code, that’s no surprise. But especially in cases where dynamic code execution is needed, Wasm checks a lot of boxes like portability and security.

The demonstration implements a temperature threshold mechanism that triggers an alert when readings rise above a certain level. This is basically a hello-world example that shows how native firmware can pass a primitive value into the runtime, and the Wasm code can call native functions (a high-temperature alert log message).

But the secret sauce is the the Wasm binary itself. You could implement a complex algorithmic processing and change that algorithm without a full OTA firmware update. In fact, Dan’s just passing the Wasm binary as a base64-encoded string and restarting the runtime without rebooting the microcontroller. This is done using the Golioth Settings service so it’s available to the entire fleet, but targetable by device or groups of devices.

However, this real time update ability is not the only trick Wasm can pull off.

The Portability of WebAssembly

Sure, it’s very cool to be able to perform a bit of brain surgery on your firmware by loading a new Wasm binary. What boggles the mind is the ability to run that binary on just about any platform imaginable.

A typical IoT installation that uses Golioth has embedded devices in the field, a server with which those devices interact (authentication, data routing, control, etc), and a cloud component to use the data and issue directives to the fleet. Your Wasm binary can be moved and executed on a different part of this system depending on need. While the demo is first run on a Nordic nRF52840 microcontroller, the same binary is shown running on the cloud, and inside of a browser.

Whether during initial development, or to meet changing device constraints or customer needs, sliding the compute from one place to another without major engineering work is a pretty interesting tool to have in your arsenal.

A Wasm Deep Dive

The proof of concept is already there for you to build your own Zephyr-based Wasm experiments. We hope you’ll give Golioth a try for deploying the binary updates to your devices.

For those who want to deeper dive into the world of WebAssembly, Dan’s been busy in that area. Checkout out his post on Understanding Every Byte in a WASM Module.

Golioth is expanding its Reference Design portfolio by adding an OpenThread Demo, a Reference Design based on our known and well-tested Reference Design Template. The purpose of the OpenThread Demo is to add Thread networking capability to the RD Template so anyone using Thread and Golioth can start development immediately, use it as a basis for their project, and take full advantage of Golioth’s Device Management, Data Routing, and Application Service capabilities.

Thread Recap

Thread is an IPv6-based networking protocol designed for low-power Internet of Things devices. It uses the IEEE 802.15.4 mesh network as the foundation for providing reliable message transmission between individual Thread Devices at the link level. The 6LoWPAN network layer sits on top of 802.15.4, created to apply Internet Protocol (IP) to smaller devices. In almost all cases, it’s used to transmit IPv6 Packets.

If you need a network of devices that can communicate with each other and connect to the Internet securely, Thread might be the solution you’re looking for.

Built it yourself

The follow-along guide shows how to build your own OpenThread Demo using widely available off-the-shelf components from our partners. We call this Follow-Along Hardware, and we think it’s one of the quickest and easiest ways to start building an IoT proof-of-concept with Golioth.

Hardware

Every mesh network needs some hardware, and for the OpenThread Demo, you will need a Thread Border Router and a Thread node. This demo doesn’t need additional sensors or an actuator, as there are generated values created by the code in the Reference Design Template (ie simulated values). Later you can modify our other Reference Designs and their hardware to get to a prototype or production device that is more specific to a vertical like Air Quality Monitoring or DC Power Monitoring.

Border Router

A Thread Border Router connects a Thread network to other IP-based networks, such as Wi-Fi or Ethernet, and it configures a Thread network for external connectivity. It also forwards information between a Thread network and a non-Thread network (from Thread nodes to the Internet). The Border Router should be completely invisible to Thread Devices, much like a Wi-Fi router is in a home or corporate network.

In this demo, we use a commercially available GL-S200 Thread Border Router designed for users to host and manage low-power and reliable IoT mesh networks.

GL-S200 provides a simple Admin Panel UI to configure the Border Router and a Topology Graph to see all the end node devices and their relationship. As a bonus, it also does NAT64 translation between IPv6 and IPv4, making it a real plug-and-play solution.

 

Thread Node

Now that the centerpiece of our Thread network is sorted, the next part is a Thread node. In the follow-along guide, we built a Thread node based on the nRF52840 DK. The node is built using Zephyr, and the OpenThread stack will be compiled into it. The GitHub repository used in the guide is open source, so you can build the application yourself, or you can use the pre-built images for the nRF52840 DK or Adafruit Feather nRF52840.

Firmware

Thread node firmware is based on the Reference Design Template, a starting point for all our Reference Designs. With all Golioth features implemented in their basic form, you can now use Device Management, Data Routing, and Application Services with Thread network connectivity.

OTA Updates

Adding Thread support to a device is not cheap, memory-wise. The firmware image is larger than 500kB, and the on-chip flash of the nRF52840 DK has a size of 1MB. Luckily, both the nRF52840 DK and the Adafurit Feather have an external flash chip, making the OTA updates possible. Any custom hardware you create in the future should also follow this model of having external flash mapped to the nRF52840.

To create a secondary partition for MCUBoot in an external flash, we must first enable it in the nrf52840dk_nrf52840.overlay file:

/ { 
    chosen { 
        nordic,pm-ext-flash = &mx25r64; 
    };
};

The CONFIG_PM_EXTERNAL_FLASH_MCUBOOT_SECONDARYKconfig option is set by default to place the secondary partition of MCUboot in the external flash instead of the internal flash (this option should only be enabled in the parent image).

To pass the image-specific variables (device-tree overlay file and Kconfig symbols) to the MCUBoot child image, we need to create a child-image folder in which we  need to update the CONFIG_BOOT_MAX_IMG_SECTORS Kconfig option. This option defines the maximum number of image sectors MCUboot can handle, as MCUboot typically increases slot sizes when external flash is enabled. Otherwise, it defaults to the value used for internal flash, and the application may not boot if the value is set too low. In our case, we updated it to 256in the child_image/mcuboot/boards/nrf52840dk_nrf52840.conf file.

CONFIG_BOOT_MAX_IMG_SECTORS=256

Connecting to Golioth Cloud

Thread nodes utilize IPv6 address space, and the question is how to communicate with IPv4 hosts, such as Golioth Cloud.

Golioth Cloud has an IPv4 address, and the Thread node needs to synthesize the server’s IPv6 address in order to connect to it. OpenThread doesn’t use the NAT64 well-known prefix 64:ff9b::/96; instead, Thread Border Routers publish their dynamically generated NAT64 prefix used by the NAT64 translator in the Thread Network Data. Thread nodes must obtain this NAT64 prefix and synthesize the IPv6 addresses.

While the process of synthesizing IPv6 addresses is automatically handled in the OpenThread CLI when using the Zephyr shell and pinging an IPv4 address (e.g. ot ping 8.8.8.8), it’s important to note that this process needs to be specifically implemented in applications.

As part of the Firmware SDK, the Golioth IPv6 address is automatically synthesized from the CONFIG_GOLIOTH_COAP_HOST_URI Kconfig symbol using the advertised NAT64 prefix by leveraging the OpenThread DNS. Even if the Golioth host URI changes within the SDK, you won’t need to change your application.

Learn more

For detailed information about the OpenThread Demo, check out more details the project page! Additionally, you can drop us a note on our Forum if you have questions about this design. If you would like a demo of this reference design, contact [email protected].

 

Embedded systems, like any software system, benefits from modularizing software components, especially as they approach production. In this talk at the Embedded Open Source Summit 2024, Golioth Firmware Lead Sam Friedman talks about how to create “microservices” for microcontrollers. He maps a popular web concept onto existing software characteristics in Zephyr and shows how a real-world example can benefit from truly modular software.

Mapping a web concept to the microcontroller realm

As Sam points out early in this talk, it’s not really about microservices, because that’s a web concept. A microservice on the web is a piece of software, normally deployed onto cloud infrastructure, that can stand alone. It has defined inputs and outputs (APIs) and can operate independent of any other microservice. This helps for scalability and testing, but is a general trend in web software and deploying applications.

Microcontrollers are smaller and traditionally operate more like a “monolith” (another web term) because everything is interconnected. But there are concepts like Inter-Process Communication (IPC), which allows constrained devices to have similar ideas. IPC is a computer science idea that helps to optimize communication inside of operating systems. As it so happens, Zephyr is a (real time) operating system. Let’s look at what these are in practice.

How firmware developers can benefit

Sam describes how the concepts of Tasks, IPC, and Event Tasks are defined and might be used. But it is the Zephyr analogs that highlights familiar features, like the relatively new ZBus methodology. If a user adds a listener on the ZBus, they can listen (subscribe) for a particular value (topic) on the bus and take action based off of it. This helps to make the overall system more modular, because the addition or removal of a feature is not deeply integrated between elements of the system. Instead, the new piece of code is reacting to data put on the bus, which reduces interdependency and improves test areas.

Real-World Example

Sam drives home his point by talking about a Golioth Reference Design like the Cold Chain Asset Tracker and how we can add capabilities like an onboard alarm when we hit a temperature threshold. Previously, this would have required refactoring to also send data from the sensor process to a new module that containes the alarm code. But with something like ZBus, the alarm can simply listen for a topic on ZBus and when the temperature module publishes to that topic, all relevant parties are updated.

This works in the opposite direction as well. Code written with this in mind would not break any future builds if a hardware cost-down removed an element like a front panel display. Instead, the user chooses not to build in that portion of the code (memory savings, yay!) and other parts of the code are not negatively impacted.

Bringing together the Cloud and Embedded Developers

Sam’s talk showcases what Golioth does well: match up the capabilities of the Cloud with the capabilities of an embedded system. Often many of the key ideas from computer science are more onerous to implement on a constrained system like a microcontroller, but Zephyr’s growing software toolbox makes it easier than ever to build a modular, testable system. Check out Sam’s talk above and his slides below for more context into how to build such a system.

 

There are 512 supported boards (according to find -name board.yml | wc -l) already in the Zephyr tree. Most of them are real hardware platforms and the remaining ones are virtual. Why would you bother with a virtual platform? Zephyr can probably build for the SoC or development board of your choice, right? In this post, I’m going to talk about the reasons you want to try out Native Simulator.

Spoiler: Your Zephyr applications development time will drop through the floor.

Zephyr support for virtual platforms

Zephyr comes with support for various virtual platforms, but two of them are most widely used:

  • QEMU
  • Native Simulator

Both are extensively used in Zephyr Continuous Integration pipelines as well as during development by Zephyr users.

QEMU

QEMU is a generic machine emulator. It emulates CPUs by interpreting architecture-specific instructions as well as some peripherals like UART, flash, and networking adapters. Its main advantage is that binary (compiled code) running on QEMU is very similar to the binary that runs on a real hardware. All the low-level instructions, memory-mapped peripheral access, constrained RAM, thread context switching, thread stack sizes, interrupt handling, step-debugging with GDB, and many others mechanisms behave almost the same as on a real microcontroller.

Networking with QEMU can be achieved by setting up a TUN/TAP interface on a Linux host system. Once set up, you attach to the emulated network adapter that is handled by Zephyr drivers. The application is built with Zephyr and has access to the same network as the host machine (like a Linux laptop). After correctly configuring the TUN/TAP interface it is possible to access internet without additional hardware.

Native Simulator

Native Simulator is a POSIX architecture based “board” (Zephyr target) that runs as a standalone Linux executable. It is based on native_simulator and Zephyr POSIX architecture. As opposed to QEMU, it does not need any middle layer that emulates instructions or peripheral access. Instead, Zephyr (under Native Simulator) runs natively on Linux with very little overhead. Most of the time, it’s as fast as any regular Linux application.

However, Native Simulator does not emulate microcontroller peripherals the same way as QEMU does. It has special modules and functions called trampolines. As an example, instead of using memory mapped I/O to handle UART drivers (and logging and shell modules that utilize UART backend) there are trampolines to translate UART access APIs to pseudo-terminal I/Os on the Linux host.

Networking with Native Simulator was possible with TUN/TAP interface. So development experience in terms of IoT applications was similar to QEMU.

The need for offloaded sockets

Issues with TUN/TAP

Networking with QEMU and/or Native Simulator requires root privileges on the host computer in order to create the TUN/TAP network interface. It routes the traffic between Zephyr and the internet. This is a bit of an inconvenience for hackers that have Zephyr SDK installed directly on their Linux workstation. Setting up proper privileges in Docker is possible as well, when such a container is used for development purposes. But what about networking in CI pipelines with GitHub Actions or GitLab CI? The only option to get that working are self-hosted runners.

Use of TUN/TAP interface allows us to test almost the entire Zephyr networking stack, down to the Ethernet layer. However there is no platform-specific driver that talks to an Ethernet phy. Instead, there is a driver that sends Ethernet frames to a virtual TUN/TAP interface that requires setup on the host (e.g. Linux) system. This has advantages like higher code coverage when testing IoT applications.

Unfortunately, there are many disadvantages as well. Setting up TUN/TAP interface requires running as a privileged user on the host system. This might not be an issue on personal PC or laptop. However, root access inside Docker might not always be possible. This is especially true when using existing infrastructure, like GitHub Codespaces, GitHub-hosted runners in GitHub Actions, or hosted GitLab Runners in GitLab.

Offloaded sockets as an alternative

Zephyr has quite a unique feature called socket offloading. This is a mechanism that allows us to utilize (offload to) an external networking stack. Such a stack can be implemented as a 3rd-party library with proprietary drivers that come with a modem. Alternatively, we could use this with an external modem, commonly used with AT commands. In both cases, the contract between the Zephyr application and the offloaded networking stack is socket-level API. One example platform that uses socket offloading is the Nordic nRF9160.

Native Simulator is just a Linux executable. There are no special permissions required to access internet when writing regular Linux programs in C.

What if BSD the compatible sockets API (socket(), connect(), recv(), send(), …) could be exposed to Zephyr when running under Native Simulator? This should be possible with a bunch of trampolines between Zephyr world and Linux world.

Native Simulator Offloaded Sockets

Implementation of socket offloading for Native Simulator was part of a recent hackday project I worked on at Golioth. At the end of day, UDP communication was working, without any setup. This confirmed the idea about networking in Zephyr without root privileges. The next step in the following months was contributing the work to Zephyr with many followup improvements, so that the community can use it.

Development speed

Why should Native Simulator be used for IoT firmware development instead of real hardware? Flashing firmware on a device, connecting to the internet, and then executing application takes a considerable amount of time. This is where Native Simulator with offloaded sockets shines.

Flashing is not part of the testing process when using Native Simulator. Connecting to the internet (e.g. using WiFi or Cellular) is not needed, since the host machine is connected all the time. And lastly, executing application code is much faster on the beefy host machine compared to a very constrained microcontroller.

This is just theory, so let’s look at some timing measurements for those not convinced yet. We’ll use http_getwith TLS with minimal modifications required to get connected to a WiFi Access Point. Modified code is available at https://github.com/mniestroj/zephyr/tree/native-sim-http-get-benchmark.

In this example we’ll use nRF52840DK with ESP32 running ESP-AT firmware. This is what the “flash + execute” process looks like:

Zephyr's http_get on native_sim vs nrf52840dk

This is how much time it took for each platform to run http_get sample (once it was already built):

  • Native Simulator: 0.42 s
  • nRF52840-DK: 16.80 s (flash 10.90 s, run 5.90 s)

Wouldn’t you like to go 40 times faster in your development?

Next steps

Many improvements to Native Simulator Offloaded Sockets were contributed to Zephyr upstream last month. Those will be part of upcoming Zephyr 3.7.0 (planned for release on 2024/07/26). When the Golioth Firmware SDK includes those changes, it will be much faster to develop and test IoT applications.

Recently I was working on upgrading a Zephyr-based project and encountered the worst of debug situations: the device was completely unresponsive after flashing the firmware. Opening a debug session didn’t yield any help, program flow never reached main, and I wasn’t even able to break on the Zephyr kernel initialization functions. What is there to do in this case? If your problems all start before user code, it’s time to check on what the bootloader is doing. Today we’ll take a look at how to debug MCUboot when all else has failed.

Debugging User Code

Debuggers usually help zero-in on bugs pretty quickly. For this project I was targeting a Thingy91 (based on the Nordic nRF9160) using a J-Link programmer, so west attach is all it takes to start the debugger. However, I was unable to get much useful output when starting a debugging session.

Using GDB to debug user code

As you can see, the debugger doesn’t recognize any symbols at the current memory addresses. This matches up with the device being unresponsive, the app hasn’t started running yet. Let’s go deeper and look at the bootloader.

Loading Bootloader Symbols Into the Debugger

The Zephyr build system already built MCUboot as part of the normal compilation process. To debug the bootloader, simply use the file command to load the .elf file from the MCUboot directory.

Loading the MCUboot elf file in GDB

When building a project for the nRF9160 under NCS, the build/mcuboot/zephyr folder contains the bootloader files. By loading the symbols from the .elf file, we have changed from debugging the user app to debugging the bootloader.

Getting a Useful Backtrace

Resetting and running program flow doesn’t lead to a crash, but we can halt after a second and check the backtrace.

MCUboot backtrace shows a panic

From this output it’s much easier to tell why our device is unresponive: mcuboot is in a panic state. That’s helpful but we really need to know why. The next step is to set a breakpoint and walk through the code.

Stepping through MCUboot with GDB

The backtrace shows that the panic happened in main. Let’s debug by setting a breakpoint there and stepping through to find more info.

MCUboot reports that it is unable to find a bootable image

After setting the breakpoint the device is reset and the continue command starts program flow. The next command is then used to run each successive call and it doesn’t take long to get to a very useful log message.

687             BOOT_LOG_ERR("Unable to find bootable image");

MCUboot needs to validate the images it is about to run, so this message indicates the image in the slot is invalid. Upon closer inspection (not shown here), some bug in the build system has allowed the image to be built too large when it should have caused the build to fail. MCUboot is aware of the partition table, and validates the signature cutting off at the hard stop of that partition size. This of course makes the signature check fail.

On some boards, this error message would have been printed out. However, it seems that the default configuration for the Thingy91 doesn’t enable terminal output for MCUboot, so instead of seeing the message we see nothing. With a little know-how, the debugger revealed the reason why.

View the Debugging Process

Sometimes a text overview is a bit hard to follow. You can see the full debugging process in the terminal capture below.

We got an early look at Nordic’s new cellular modem, the nRF9151, and it already works with Golioth!

With any new board, we ask ourselves “can we connect it to Golioth?”. You may remember a similar post when the nRF7002-DK first came out. Of course the answer for these two boards, and pretty much all other network-enabled embedded systems, is: yes, you can use them with Golioth. So today we’ll walk though the experience of connecting the nRF9151 to Golioth for the first time.

What’s new with the nRF9151?

We love the nRF9160 cellular modem and have support for it in all of the Golioth Firmware SDK samples, as well as using it in the Hardware-in-the-Loop (HIL) testing that is connected to our continuous integration infrastructure. So what’s the deal with the new part?

Finger pointing at a small rectangular chip (SOC)Most obviously, it’s really really small. The 9151 is about a 20% size reduction from the 9160 (new dimensions are approximately 11×12 mm). Here you can see it’s smaller than the fingernail on my pointer finger. The smaller sized also delivers lower peak current consumption. As with the recently announced nRF9161, the nRF9151 supports DECT NR+. And Nordic indicates the new design is fully compatible with the existing nRF91 family of chips.

This is also the first Nordic dev board I’ve seen that uses a USB-C connector. While you can’t get your hands on one of these just yet, since Golioth is partners with Nordic they were kind enough to send us one of these nRF9151 Development Kits to take for a test drive.

Building Golioth examples with the nRF9151

This board is not yet available to order, but support has already been added to Zephyr. To get it working with Golioth, we needed a fix that Nordic merged after their v2.6.1 release of the nRF Connect SDK (NCS). So today I’ll be checking out a commit in between releases. When Nordic releases v2.7.0 everything will work without this extra step.

0. Install the Golioth Firmware SDK

You will need an NCS build environment along with the Golioth SDK. You can follow the Golioth Docs to install an NCS workspace, or add Golioth to your existing NCS workspace.

1. Update NCS version (if needed)

If you are using NCS v2.7.0 (not yet released at the time of writing) or newer, you can skip this step. Otherwise, edit your west manifest and update the NCS version. Below is the west-nrf.yml file from the Golioth SDK with the changed line highlighted.

manifest:
  projects:
    - name: nrf
      revision: 85097eb933d93374fe270ce4c004bea10ee80e97
      url: http://github.com/nrfconnect/sdk-nrf
      import: true

  self:
    path: modules/lib/golioth-firmware-sdk

This happened to be the commit at the tip of main when writing this post. We usually recommend against targeting commits in between releases, so consider this experimental.

2. Add a board Kconfig file for the nRF9151

Add the board-specific configuration to the boards’ directory. For today’s post, I’m building the Golioth stream sample so I’ve added this nrf9151dk_nrf9151_ns.conf board file to that sample directory.

# General config
CONFIG_HEAP_MEM_POOL_SIZE=4096
CONFIG_NEWLIB_LIBC=y

# Networking
CONFIG_NET_SOCKETS_OFFLOAD=y
CONFIG_NET_IPV6=y
CONFIG_NET_IPV6_NBR_CACHE=n
CONFIG_NET_IPV6_MLD=n

# Increase native TLS socket implementation, so that it is chosen instead of
# offloaded nRF91 sockets
CONFIG_NET_SOCKETS_TLS_PRIORITY=35

# Modem library
CONFIG_NRF_MODEM_LIB=y
CONFIG_NRF_MODEM_LIB_ON_FAULT_APPLICATION_SPECIFIC=y

# LTE connectivity with network connection manager
CONFIG_NRF_MODEM_LIB_NET_IF=y
CONFIG_NRF_MODEM_LIB_NET_IF_AUTO_START=y
CONFIG_NRF_MODEM_LIB_NET_IF_AUTO_CONNECT=y
CONFIG_NRF_MODEM_LIB_NET_IF_AUTO_DOWN=y

CONFIG_NET_CONNECTION_MANAGER=y
CONFIG_NET_CONNECTION_MANAGER_MONITOR_STACK_SIZE=1024

# Increased sysworkq size, due to LTE connectivity
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=2048

# Disable options y-selected by NCS for no good reason
CONFIG_MBEDTLS_KEY_EXCHANGE_DHE_PSK_ENABLED=n
CONFIG_MBEDTLS_KEY_EXCHANGE_DHE_RSA_ENABLED=n

# Generate MCUboot compatible images
CONFIG_BOOTLOADER_MCUBOOT=y

3. Build the Golioth stream sample

Building and running this sample is now quite simple. I have included the option to use runtime credentials in this build so that we can provision the device from the Zephyr shell.

$ cd examples/zephyr/stream
$ west build -b nrf9151dk/nrf9151/ns -- -DEXTRA_CONF_FILE=../common/runtime_settings.conf
$ west flash

4. Provision and run the sample

Golioth is free for individual use so sign up for an account if you have not already done so. After creating a project and device we can provision the PSK-ID/PSK by opening a serial connection to the device.

uart:~$ settings set golioth/psk-id <your-psk-id>
uart:~$ settings set golioth/psk <your-psk>

Here’s the terminal output during my tests:

*** Booting nRF Connect SDK v2.6.99-85097eb933d9 ***
*** Using Zephyr OS v3.6.99-18285a0ea4b9 ***
[00:00:00.538,452] <inf> fs_nvs: 2 Sectors of 4096 bytes
[00:00:00.538,482] <inf> fs_nvs: alloc wra: 0, fb8
[00:00:00.538,482] <inf> fs_nvs: data wra: 0, 68
[00:00:00.538,879] <dbg> golioth_stream: main: Start Golioth stream sample
[00:00:00.539,001] <inf> golioth_samples: Bringing up network interface
[00:00:00.539,001] <inf> golioth_samples: Waiting to obtain IP address
[00:00:01.691,894] <inf> lte_monitor: Network: Searching
uart:~$ settings set golioth/psk-id 20240603190757-nrf9151dk@nrf9151-demo
Setting golioth/psk-id to 20240603190757-nrf9151dk@nrf9151-demo
Setting golioth/psk-id saved as 20240603190757-nrf9151dk@nrf9151-demo
uart:~$ settings set golioth/psk e487ea809e5fa705c2af4050150f822c
Setting golioth/psk to e487ea809e5fa705c2af4050150f822c
Setting golioth/psk saved as e487ea809e5fa705c2af4050150f822c
[00:01:10.748,168] <inf> lte_monitor: Network: Registered (roaming)
[00:01:10.748,901] <inf> golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
[00:01:12.994,964] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:01:12.995,025] <inf> golioth_stream: Sending temperature 20.000000 (sync)
[00:01:12.995,269] <inf> golioth_stream: Golioth client connected
[00:01:12.995,269] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[00:01:13.543,975] <dbg> golioth_stream: temperature_push_cbor: Temperature successfully pushed
[00:01:18.544,067] <inf> golioth_stream: Sending temperature 20.500000 (async)
[00:01:20.953,582] <wrn> golioth_coap_client: Resending request 0x2001e2c0 (reply 0x2001e308) (retries 2)
[00:01:23.544,311] <inf> golioth_stream: Sending temperature 21.000000 (sync)
[00:01:25.544,677] <wrn> golioth_stream: Failed to push temperature: 9
[00:01:25.772,186] <wrn> golioth_coap_client: Resending request 0x2001e2c0 (reply 0x2001e308) (retries 1)
[00:01:25.947,631] <wrn> golioth_coap_client: Resending request 0x2001e440 (reply 0x2001e488) (retries 2)
[00:01:30.544,738] <inf> golioth_stream: Sending temperature 21.500000 (async)
[00:01:30.581,359] <dbg> golioth_stream: temperature_async_push_handler: Temperature successfully pushed
[00:01:30.949,401] <dbg> golioth_stream: temperature_async_push_handler: Temperature successfully pushed
[00:01:35.544,952] <inf> golioth_stream: Sending temperature 22.000000 (sync)
[00:01:36.326,812] <dbg> golioth_stream: temperature_push_cbor: Temperature successfully pushed
[00:01:41.326,873] <inf> golioth_stream: Sending temperature 22.500000 (async)
[00:01:42.582,946] <dbg> golioth_stream: temperature_async_push_handler: Temperature successfully pushed
[00:01:46.327,117] <inf> golioth_stream: Sending temperature 23.000000 (sync)
[00:01:46.947,204] <dbg> golioth_stream: temperature_push_cbor: Temperature successfully pushed
[00:01:51.947,296] <inf> golioth_stream: Sending temperature 23.500000 (async)
[00:01:52.718,261] <dbg> golioth_stream: temperature_async_push_handler: Temperature successfully pushed
[00:01:56.947,540] <inf> golioth_stream: Sending temperature 24.000000 (sync)
[00:01:57.663,665] <dbg> golioth_stream: temperature_push_cbor: Temperature successfully pushed
[00:02:02.663,726] <inf> golioth_stream: Sending temperature 24.500000 (async)
[00:02:03.725,708] <dbg> golioth_stream: temperature_async_push_handler: Temperature successfully pushed
[00:02:07.663,970] <inf> golioth_stream: Sending temperature 25.000000 (sync)
[00:02:08.589,111] <dbg> golioth_stream: temperature_push_cbor: Temperature successfully pushed
[00:02:13.589,172] <inf> golioth_stream: Sending temperature 25.500000 (async)

5. View the data sent from the device

In the Golioth web console I can navigate to the LightDB Stream tab for the device and see the data as it arrives on the cloud. Try out Pipelines to transform and send that data to a destination.

A table of temperature data displayed on the Golioth web console

What will you do with the nRF9151?

We see a lot of IoT deployments using the nRF9160 to provide a cellular connection. They’re versatile parts with plenty of peripherals. The new nRF9151 part number is nice for your board footprint, and your power budget. And of course, every fleet needs management and data handling. Golioth already works with this SoC and so many more!

Golioth will be joining our friends at Digikey on June 13th to talk about “Leveraging Zephyr to enable super-flexible IoT designs”.

Digikey is where we source many of the parts for our custom hardware and where we often order development boards for putting together demos. If you’ve seen our “Follow Along Hardware”, the product SKUs revolve around Digikey stock.

So we thought it would be a great opportunity to showcase just how many different boards, chips, and sensors we can control using the same base Zephyr code, while sending fleet back to the Golioth cloud.

What we will cover

In the upcoming webinar, we’ll cover:

  • The basics of Zephyr RTOS and how to get started designing quickly
  • How one code base can serve designs from 3 different microcontroller vendors with 3 different types of connectivity and two different sensor vendors!
    • NXP, Espressif, Nordic processors
    • Ethernet, Wi-Fi, Cellular Communications
    • Sensors from Infineon and Bosch
  • How to utilize Cloud services to deliver interesting features to a product with a single SDK install
  • How Golioth’s end-to-end Reference Designs can jumpstart your own IoT designs
  • How the recently announced Pipelines feature will enable even more flexibility in designs

How to register

Register for the event using this link. You will also be able to get access to the recording if you can’t make the live event, though Golioth staff be available for live Q&A directly after our presentation.

Yesterday I was upgrading a Golioth Reference Design to the newest version of the Golioth Firmware SDK and I encountered a network error I had never seen before. I was able to track it down fairly quickly using the debugging tools built into Zephyr. This process is quite handy, so today I’ll walk through how to debug a network error in Zephyr using GDB to help others hone their embedded debugging skills.

Encountering an Error

There are two errors shown below. The first is expected: the cell modem is not yet connected to the network so sending data will fail. But soon after the connection is established there is a second error highlighted below.

*** Booting nRF Connect SDK v2.5.2 ***
[00:00:00.465,942] <inf> fs_nvs: 2 Sectors of 4096 bytes
[00:00:00.465,942] <inf> fs_nvs: alloc wra: 0, fb8
[00:00:00.465,972] <inf> fs_nvs: data wra: 0, 68
[00:00:00.466,308] <dbg> golioth_powermonitor: main: Start Power Monitor Reference Design
[00:00:00.466,339] <inf> golioth_powermonitor: Firmware version: 1.2.0
[00:00:00.472,991] <inf> golioth_powermonitor: Modem firmware version: mfw_nrf9160_1.3.1
[00:00:00.474,456] <inf> golioth_powermonitor: Connecting to LTE, this may take some time...
[00:00:00.531,127] <inf> app_sensors: Device: ina260@40, 4.980000 V, 0.335000 A, 1.659999 W
[00:00:00.531,219] <inf> app_sensors: Device: ina260@41, 5.117499 V, 0.000000 A, 0.000000 W
[00:00:00.531,250] <dbg> app_sensors: app_sensors_read_and_stream: Ontime:      (ch0): 1        (ch1): 0
[00:00:00.531,463] <err> app_sensors: Failed to send sensor data to Golioth: 5
[00:00:02.397,918] <inf> lte_monitor: Network: Searching
[00:00:03.772,033] <inf> lte_monitor: Network: Registered (roaming)
[00:00:03.772,521] <inf> golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
[00:00:03.773,223] <inf> golioth_fw_update: Current firmware version: main - 1.2.0
[00:00:06.010,528] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:00:06.010,803] <inf> golioth_powermonitor: Golioth client connected
[00:00:06.010,833] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[00:00:06.421,752] <dbg> app_state: async_handler: State successfully set
[00:00:06.443,542] <err> net_coap: 16 is > sizeof(coap_option->value)(12)!
[00:00:06.443,572] <dbg> app_state: app_state_desired_handler: desired
                                    66 61 6c 73 65                                   |false
[00:00:06.533,081] <inf> app_sensors: Device: ina260@40, 4.982499 V, 0.331250 A, 1.649999 W
[00:00:06.533,203] <inf> app_sensors: Device: ina260@41, 5.115000 V, 0.001250 A, 0.000000 W
[00:00:06.533,233] <dbg> app_sensors: app_sensors_read_and_stream: Ontime:      (ch0): 6003     (ch1): 1
[00:00:06.536,865] <inf> app_settings: Set loop delay to 10 seconds
[00:00:06.538,330] <inf> app_sensors: Device: ina260@40, 4.980000 V, 0.335000 A, 1.679999 W
[00:00:06.538,482] <inf> app_sensors: Device: ina260@41, 5.113749 V, 0.001250 A, 0.000000 W
[00:00:06.538,513] <dbg> app_sensors: app_sensors_read_and_stream: Ontime:      (ch0): 6008     (ch1): 6
[00:00:06.942,932] <dbg> app_settings: on_loop_delay_setting: Received LOOP_DELAY_S already matches local value.
[00:00:06.946,136] <dbg> app_settings: on_loop_delay_setting: Received LOOP_DELAY_S already matches local value.
[00:00:06.949,096] <dbg> app_settings: on_loop_delay_setting: Received LOOP_DELAY_S already matches local value.
[00:00:06.992,187] <inf> golioth_rpc: RPC observation established
[00:00:06.993,041] <inf> golioth_fw_update: Waiting to receive OTA manifest
[00:00:07.365,142] <dbg> app_sensors: get_cumulative_handler: Decoded: ch0: 1579017, ch1: 790405
[00:00:07.365,722] <dbg> app_state: async_handler: State successfully set
[00:00:07.367,553] <dbg> app_sensors: get_cumulative_handler: Decoded: ch0: 1579017, ch1: 790405
[00:00:07.368,133] <dbg> app_state: async_handler: State successfully set
[00:00:07.765,747] <inf> golioth_fw_update: Received OTA manifest
[00:00:07.765,777] <inf> golioth_fw_update: Manifest does not contain different firmware version. Nothing to do.
[00:00:07.765,808] <inf> golioth_fw_update: Waiting to receive OTA manifest

Hmmm, I wonder where this error came from?

<err> net_coap: 16 is > sizeof(coap_option->value)(12)!

I’ve never seen an error of this type before. Zephyr logging puts a logging tag at the beginning of each message and net_coap isn’t one that comes to mind. I started troubleshooting by “grepping”, or searching all files in a directory tree, for this message.

➜ rg net_coap
zephyr/subsys/net/lib/coap/coap.c
8:LOG_MODULE_REGISTER(net_coap, CONFIG_COAP_LOG_LEVEL);
1903:void net_coap_init(void)

Seven different files were returned by rg (that’s ripgrep, which is just a different flavor of grep), but the first one is obviously what we want. You can see the exact net_coap name registered as the logging module tag.

Looking inside that file, I searched for the error message. I only searched for the > sizeof part of the error message since the rest is likely being added to the log using string substitution.

if (option) {
    /*
     * Make sure the option data will fit into the value field of
     * coap_option.
     * NOTE: To expand the size of the value field set:
     * CONFIG_COAP_EXTENDED_OPTIONS_LEN=y
     * CONFIG_COAP_EXTENDED_OPTIONS_LEN_VALUE=<size>
     */
    if (len > sizeof(option->value)) {
        NET_ERR("%u is > sizeof(coap_option->value)(%zu)!",
            len, sizeof(option->value));
        return -EINVAL;
    }

Now we’re getting somewhere. The Zephyr contributor who worked on this code was even kind enough to leave comments on how to fix the error. However, I want to know what is causing the issue in the first place since I’m unfamiliar with this failure. Let’s use the debugger!

Using GDB to Debug Zephyr

Many boards that work with Zephyr have debugging support built right into the ecosystem. From the same directory where the build was run, I can run west attach to start GBD.

In GDB, first type mon reset to prepare the device to start from the beginning of the program. I know from the excerpt above that the error message is printed out from line 590 in the coap.c file. We can use the command b coap.c:590 to set a breakpoint, and start running using c for continue.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/mike/golioth-compile/reference-design-dc-power-monitor/app/build/zephyr/zephyr.elf...
Remote debugging using :2331
arch_cpu_idle () at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/arch/arm/core/aarch32/cpu_idle.S:143
143             cpsie   i
(gdb) mon reset
Resetting target
(gdb) b coap.c:590
Breakpoint 1 at 0x2139e: file /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/subsys/net/lib/coap/coap.c, line 590.
(gdb) c
Continuing.

Breakpoint 1, parse_option (data=0x20013bd1 <rx_buffer> "hE\212\361\206)\035渠.>a\002R.d\adesired\r\003reset_cumulative\377false", offset=<optimized out>, pos=pos@entry=0x2001a54c <golioth_thread_stacks+5708>, max_len=<optimized out>, opt_delta=opt_delta@entry=0x2001a54e <golioth_thread_stacks+5710>, opt_len=opt_len@entry=0x2001a54a <golioth_thread_stacks+5706>, option=option@entry=0x2001a578 <golioth_thread_stacks+5752>) at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/subsys/net/lib/coap/coap.c:590
590                             NET_ERR("%u is > sizeof(coap_option->value)(%zu)!",
(gdb)

Great, we stopped where the error message is printed. At this point I want to know what my program was doing leading up to this moment. For this we can view the backtrace by typing bt.

(gdb) bt
#0  parse_option (data=0x20013bd1 <rx_buffer> "hE\212\361\206)\035渠.>a\002R.d\adesired\r\003reset_cumulative\377false", offset=<optimized out>,
    pos=pos@entry=0x2001a54c <golioth_thread_stacks+5708>, max_len=<optimized out>, opt_delta=opt_delta@entry=0x2001a54e <golioth_thread_stacks+5710>,
    opt_len=opt_len@entry=0x2001a54a <golioth_thread_stacks+5706>, option=option@entry=0x2001a578 <golioth_thread_stacks+5752>)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/subsys/net/lib/coap/coap.c:590
#1  0x0004f8bc in coap_find_options (cpkt=cpkt@entry=0x20020950, code=code@entry=23, options=options@entry=0x2001a578 <golioth_thread_stacks+5752>, veclen=veclen@entry=1)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/subsys/net/lib/coap/coap.c:907
#2  0x0004fa30 in coap_get_option_int (cpkt=cpkt@entry=0x20020950, code=code@entry=23)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/subsys/net/lib/coap/coap.c:1282
#3  0x00031308 in golioth_coap_req_reply_handler (req=req@entry=0x20021558, response=response@entry=0x20020950)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/modules/lib/golioth-firmware-sdk/src/zephyr_coap_req.c:180
#4  0x00055e38 in golioth_coap_req_process_rx (client=client@entry=0x20020518, rx=rx@entry=0x20020950)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/modules/lib/golioth-firmware-sdk/src/zephyr_coap_req.c:362
#5  0x000326be in golioth_process_rx_data (len=<optimized out>, data=<optimized out>, client=0x20020518)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/modules/lib/golioth-firmware-sdk/src/coap_client_zephyr.c:866
#6  golioth_process_rx (client=0x20020518) at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/modules/lib/golioth-firmware-sdk/src/coap_client_zephyr.c:949
#7  golioth_coap_client_thread (arg=0x20020518) at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/modules/lib/golioth-firmware-sdk/src/coap_client_zephyr.c:1092
#8  0x0004d8a8 in z_thread_entry (entry=0x566ad <golioth_thread_main>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>)
    at /home/mike/golioth-compile/reference-design-dc-power-monitor/deps/zephyr/lib/os/thread_entry.c:48
#9  0xaaaaaaaa in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

The backtrack places the most recent function call at the top in position #0. Looking down the list I can see that starting on line #3 the Golioth SDK is calling a Zephyr CoAP function. Walking back through those function calls I established that we received a CoAP packet and are trying to decode the options stored in that packet.

I don’t really need to know how all of that packet handling is done… what is more important to me is to see the packet itself to help illuminate why there’s an option in it that is too big for the configured space. Luckily, GDB lets us look at what’s stored in memory.

Using GDB to Inspect Data in Memory

If we look at the coap.c sourcecode, we find the breakpoint we set is inside the of the parse_option function.

static int parse_option(uint8_t *data, uint16_t offset, uint16_t *pos,
            uint16_t max_len, uint16_t *opt_delta, uint16_t *opt_len,
            struct coap_option *option)

This has a data array as a parameter that likely has our coap packet in it. We can print this out to see the data. It’s as simple as p data, with data being the name of the variable.

(gdb) p data
$1 = (uint8_t *) 0x20013bd1 <rx_buffer> "hE\212\361\206)\035渠.>a\002R.d\adesired\r\003reset_cumulative\377false"

(Note: yes, That 渠 is what GDB actually outputs. Binary data sometimes has weird consequences, especially when there are unicode characters for symbols that match)

We’re getting somewhere, but this is not all that useful since it was printed as ASCII values instead of showing the actual hexadecimal data. Let’s print that out.

(gdb) p/x data@max_len
value has been optimized out

The p/x data@max_len command tells GDB to print hexidecimal data from an array called data and to use the max_len variable to determine how many bytes to print. But it looks like we’re stymied by the optimization of the program.

The max_len of the data array has already been optimized out and is unavailable to us. The next thing to do is to print out an arbitrary number of bytes by guessing at the length of the data array. Since we were already able to print it I’m guess it’s about 64 bytes and then using the ASCII values of the final parts of that string to figure out where the data actually ends:

(gdb) x/64xb data
0x20013bd1 <rx_buffer>: 0x68    0x45    0x8a    0xf1    0x86    0x29    0x1d    0xe6
0x20013bd9 <rx_buffer+8>:       0xb8    0xa0    0x2e    0x3e    0x61    0x02    0x52    0x2e
0x20013be1 <rx_buffer+16>:      0x64    0x07    0x64    0x65    0x73    0x69    0x72    0x65
0x20013be9 <rx_buffer+24>:      0x64    0x0d    0x03    0x72    0x65    0x73    0x65    0x74
0x20013bf1 <rx_buffer+32>:      0x5f    0x63    0x75    0x6d    0x75    0x6c    0x61    0x74
0x20013bf9 <rx_buffer+40>:      0x69    0x76    0x65    0xff    0x66    0x61    0x6c    0x73
0x20013c01 <rx_buffer+48>:      0x65    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x20013c09 <rx_buffer+56>:      0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00

The x/64xb data command prints out exactly what we’re after. In GDB the x command prints out memory contents (I always remember this as “examine”). The slash (/) adds the additional commands to print 64 hexidecimal (x) bytes (b) starting from the pointer address named data.

Decoding the CoAP Packet

After just a bit of cleanup, I have the data I’m after but it’s certainly not human readable. I like to use a site called Koap Online CoAP Decoder to take care of this for me:

When we go back to the original error message, the option that is too long is 16 characters. From the decoding above we see that the third option is a path called reset_cumulative that is 16 characters long. This is too long for the 12 character buffer we have configured in the Zephyr CoAP library!

I did this to myself! The application I’m working on is observing a Golioth LightDB State path and I chose a long name:

I followed the advice from the code comments in the Zephyr file and that fixed things right up.

# Adjust coap setting for a long (16-char) LightDB State sub-path
CONFIG_COAP_EXTENDED_OPTIONS_LEN=y
CONFIG_COAP_EXTENDED_OPTIONS_LEN_VALUE=16

Make the Debugger Your Go-To

The worst part about using a debugger is usually setting things up. But in many cases, that work has already been done for you in the Zephyr ecosystem. Try out these skills the next time an unfamiliar error pops up in your embedded development work!

Manufacturing is marathon, not a sprint. Zephyr RTOS includes numerous features to help you at every step along the way, from initial prototype, to maintaining your hardware fleet in the field. Golioth’s Developer Releations lead, Chris Gammell, spoke on this topic at the 2024 Embedded Open Source Summit.

Chris’ approach boils down to breaking manufacturing into five distinct phases:

Golioth - going to production with hardware

  1. Early prototype
  2. Custom hardware
  3. First device in production
  4. Scaling production
  5. Maintaining a scaled fleet

The challenges of each phase exist whether or not you’re using Zephyr. But this RTOS has good tools you should utilize to smooth out many wrinkles. Let’s walk through each phase to see what is involved. The full set of talk slides is available at the bottom of this post.

Early Prototyping on Dev Boards

Chris always starts his prototyping out with commercially available development boards when possible. This means the hardware is in a known working state. Even if you haven’t finalized all of your hardware choices, Zephyr offers great portability so you can relatively easily change to a different part without the need to scrap your early work.

Zephyr also offers a number of tools for early tinkering. The menuconfig system is excellent to explore the configuration options available for the peripherals you have chosen. And the Zephyr shell is fantastic when validating new parts. For instance, the sensor and i2c shells facilitate live interaction with your sensors before getting down to the business of writing C code. Read about Golioth community member Timon switching over to Zephyr for prototyping.

Custom HW, First Pilot

As you move into your first pilot, this will likely be the first time you stand up custom hardware to ensure the system design works. Take time here to validate all of the parts in the design. Now is when you should be looking to see you have the feature coverage necessary to meet your needs. Confirm that the parts you have on the board are all needed, and ditch the ones that aren’t.

This is also a great time to begin planning for how you will test and provision each device. What kind of test points do you need? Ensure you’ve correctly routed the programming header and test placement for quick work during manufacturing.

Zephyr’s debugging features come into play during here. Consider the best setup for Zephyr’s logging system, whether that’s just turning it on and off, or changing up backends like the Golioth logging backend that sends logs to the cloud. Give thread-aware debugging suites like Ozone and Systemview a try before you need them. You’ll get a ton of insight to how your system is performing before a showstopper forces you to!

First Devices in Production

Pick a number, maybe that’s 100, of devices to join your first manufacturing run. This will be the first glimpse you have into some of the issues that will surface when you scale your production.

At this point, Chris likes to reach for the Zephyr board definitions and makes use of the support for board revisions. When peculiar behavior happens, the ease of compiling the same code for two different board revisions will help you discover if it’s something that’s always been there, or just arrived at the party.

commands to build firmware for different revisions of a board

Now is the time to set up your hardware-in-the-loop testing. You need to move fast and manual testing is the opposite of that. It’s not too late to adapt hardware for automated testing and you’ll thank yourself later. Once you have a programming and serial interface to the boards, Zephyr will swoop in with Twister and pytest that can be run on every PR and merge to catch problems early and run cycle tests far more frequently than you would otherwise.

Finally, don’t forget to plan for how you will perform firmware updates. Sure, you can plug USB cables into the 100 units you have in front of you, but that’s going to get really old when you do two patch releases in the same week. Don’t wait until you start to scale, set up your OTA updates now so you can begin testing automatic updates. With Golioth, you can do OTA from day one!

Scaling Production

This is it, time to turn the process up to 11 and start churning out boards. Smart decisions now will have a huge impact on your bottom line, so firm up those decisions on whether or not you need the top chip version in the family or can take it down a notch or two.

Consider how your choices affect cost after production. For instance, power budget is often a very large consideration. Zephyr includes a Power Management that should be used for battery-operated devices.

Network bandwidth is more directly related to monetary cost; optimizing your data usage leads to lower cellular and data usage bills. Even small savings scale! Consider configuring log levels to disable debug and info messages during normal operation. The Golioth settings service or an RPC can be used to remotely configure this. The same is true for what data is being streamed back to the servers and how frequently. We also recommend implementing a reboot RPC as a simple version of the “have you tried turning it off and back on again?” adage.

Chris touches on the topic of developing test stands for use during manufacturing. These interface with your hardware, and may work in conjunction with custom Zephyr shell commands to control the device during tests.

Maintaining a Scaled Fleet

You haven’t really crossed the finish line until your deployed devices reach their usable lifetime in the field. This means maintenance as myriad different operating conditions are sure to turn up unknown behavior.

If you followed Chris’ guidance in previous steps, your OTA update system is already in place and can be utilized to push out updates to address problems. Be sure to take advantage of simple things like Zephyr’s watchdog subsystem for automatic reboot when all else fails. But ultimately you want to fix the problems in place, so leveraging core dumps, and perhaps pushing fixes outside of full updates using the LLEXT feature in Zephyr is worth a look.

Slides

Give Chris’ talk a shot. There’s a ton of useful information there, whether this is your first rodeo or you’ve been rolling boards off of the production line since Chris was still in diapers. Manufacturing is defined by change. Embrace that concept and you’ll never be left behind.

Slides are below, the video is embedded at the top of the post.