Almost two years ago I wrote a guide on how to interface with USB devices from WSL2 because many of our users were developing on Windows but wanted to use Linux-native tools for projects like Zephyr. I’ve heard from countless devs thanking me for the guide but with one wish – a Graphical User Interface (GUI) to simplify the process of managing USB devices. The most common alternative to running Linux tools on Windows is usually a full blown VM like VirtualBox, which has a nice UI. usbipd-win is certainly no GUI.

USB Settings in VirtualBox

I recently had to set up a new Windows machine and decided to see if there’s been any enhancements to the USB support in WSL2. I was pleasantly surprised to find not one but two GUIs! The rest of this post will walk you through the two options but will assume you have read the original blog.

WSL USB Manager

The first tool I found is a nice Python-based desktop GUI hosted over on GitLab. Installation is a breeze thanks to the MSI installer included in official releases. It delivers on the fundamentals of being a USB manager – list, view state, attach/detach – but it also adds some niceties.

screenshot of wsl-usb-gui

Source: https://gitlab.com/alelec/wsl-usb-gui

For example, usbipd-win added an auto-attach feature since my original article, which can be very handy when your embedded device reboots and needs to be re-attached. However, it can be tricky to set up from the commandline. WSL USB Manager allows you to define a profile right from the UI that handles auto-attach more smoothly. Another neat thing it does is automatically create the correct UDEV rule for you so you have one less permission issue to deal with.

USBIP Connect for VS Code

Given the popularity of Visual Studio Code, it’s no surprise someone created USBIP Connect. Setup is simple but there’s a few important details to get it working correctly.

First, you need to install the official WSL extension and connect to WSL. Install USBIP Connect from the marketplace only after connected, otherwise things won’t work. You’ll know if you’ve done the setup correctly if you see USBIP Connect listed under the “WSL: UBUNTU” list of extensions, not “LOCAL.”

USBIP Connect for VS Code installed in WSL

The extension adds a simple “Attach” and “Detach” button to the status bar and corresponding commands to the Command Palette. Simplicity is the name of the game, which is a plus for a VS Code extension.

The only issue is that permissions are not automated like in WSL USB Manager, so you’ll get an error the first time you try to Attach a device.

USBIP Connect permission error

You’ll either need to use a terminal with Administrator privileges or WSL USB Manager, which is what I did while writing this post.

USB all the things, again!

So there you have it, two different GUIs for managing USB devices with WSL2. Hopefully they help the remaining Windows devs who prefer something graphical over using the commandline to get work done with a little less frustration. But if you run into any issues, please post in our forum.

Golioth is an IoT company that supports as much custom hardware as possible: a multitude of microcontrollers and many different connection types. This presents a challenge when testing on real hardware.

We developed tooling that tests the Golioth Firmware SDK on actual boards. Known as Hardware-in-the-Loop (HIL) testing, it’s an important part of our CI strategy. Our latest development is automatically detecting boards connected to our CI machines; these can recognize what type of hardware is attached to each self-hosted test runner. In a nutshell, you plug in the board to USB and run a workflow to figure out what you just plugged it.

Let’s take a look at how we did that, and why it’s necessary.

Goals and Challenges of Hardware Testing

So the fundamental goals here are:

  1. Run tests on actual hardware
  2. Make the number of tests scalable
  3. Make the number of hardware types scalable

If we were running all of our firmware tests manually, we’d be fine on goal #1. But when scaling and repeatability comes into play, we start to look at automation. In particular, we look at making it easy to create more of the self-hosted runners (the computers running the automated test programs).

Adding a new board to one of those runners is no small task. If you build a self-hosted runner today, then next week add two new hardware variations that need testing, how will the runner know how to interface with those boards?

Frequent readers of the Golioth blog will remember that we already set up self-hosted runners, so go back and look the first post and second post in this series. Those articles detail how we are using GitHub’s self-hosted runner infrastructure to use workflows to perform Hardware-in-the-Loop tests. I also did a Zephyr tech talk on self-hosted runners.

Github workflow output from board recognition processTo make things more scalable, we developed a workflow that queries all of the hardware connected to a runner. It runs some tests to determine the type of hardware, the port, and the programmer serial number (or other unique id) of that hardware. This information is written to a yaml file for that runner and made available to all other workflows that need to run on real hardware.

Recognizing Microcontroller Boards Connected Via USB

Self-hosted runners are nothing more than a single-board computer running Linux and connected to the internet (and in our case connected to GitHub). These runners are in the home offices of my colleagues around the world, but it doesn’t matter where they are. Our goal is that we don’t need to touch the runners after initial set up. If we add new hardware, just plug in a USB cable and the hardware part of the setup is done.

The first step in recognizing what new hardware has been plugged in is listing all the serial devices. Linux provides a neat little way of doing this by the device ID:

golioth@orangepi3-lts:~$ ls /dev/serial/by-id -1a
.
..
usb-SEGGER_J-Link_000729255509-if00
usb-SEGGER_J-Link_001050266122-if00
usb-SEGGER_J-Link_001050266122-if02
usb-Silicon_Labs_CP2102N_USB_to_UART_Bridge_Controller_4ea3e6704b5fec1187ca2c5f25bfaa52-if00-port0

This represents three devices: Nordic, NXP, and Espressif. For now we assume the SiLabs USB-to-UART is an Espressif device because we don’t have any other devices that use that chip. However, there are multiple SEGGER entries, so we need to sort those out. We also need to know which type of ESP32 board is connected.

We use regular expressions to pull out the unique serial numbers from each of these and perform the board identification in the next step.

Using J-Link to Identify Board Type

Boards that use J-Link include a serial number in their by-id listing that can be used to gather more information. This works great for Nordic development boards. It works less great for NXP boards.

from pynrfjprog import LowLevel
from board_classes import Board

chip2board = {
    # Identifying characteristic from chip: Board name used by Golioth
    LowLevel.DeviceName.NRF9160: "nrf9160dk",
    LowLevel.DeviceName.NRF52840: "nrf52840dk",
    LowLevel.DeviceName.NRF5340: "nrf7002dk"
    }

def get_nrf_device_info(snr):
    with LowLevel.API() as api:
        api.connect_to_emu_with_snr(snr)
        api.connect_to_device()
        api.halt()
        return api.read_device_info()

def nrf_get_device_name(board: Board):
    with LowLevel.API() as api:
        snr_list = api.enum_emu_snr()

    if board.snr not in snr_list:
        return None

    try:
        device_info = get_nrf_device_info(board.snr)
        if not device_info:
            autorecognized = False
        else:
            for i in device_info:
                if i in chip2board:
                    board.name = chip2board[i]
                    autorecognized = True
    except Exception as e:
        print("Exception in find_nrf_device():", e)
        autorecognized = False

    return autorecognized

Nordic provides a Python package for their nrfjprog tool. Its low-level API can read the target device type from the J-Link programmer. Since we already have the serial number included in the port listing, we can use the script above to determine which type of chip is on each board. For now, we are only HIL testing three Nordic boards and they all have different processors which makes it easy to automatically recognize these boards.

An NXP mimxrt1024_evk board is included in our USB output from above. This will be recognized by the script but the low-level API will not be able to determine what target chip is connected to the debugger. For now, we take the port and serial number information and manually add the board name to this device. We will continue to refine this method to increase automation.

Using esptool.py to Guess Board Type

Espressif boards are slightly trickier to quantify. We can easily assume which devices are ESP32 based on the by-id path. And esptool.pyEspressif’s software tool for working with these chips – can be used in a roundabout way to guess which board is connected.

import esptool
from board_classes import Board

chip2board = {
    # Identifying characteristic from chip: Board name used by Golioth
    'ESP32': 'esp32_devkitc_wroom',
    'ESP32-C3': 'esp32c3_devkitm',
    'ESP32-D0WD-V3': 'esp32_devkitc_wrover',
    'ESP32-S2': 'esp32s2_saola',
    'ESP32-S3': 'esp32s3_devkitc'
    }

## caputure output: https://stackoverflow.com/a/16571630/922013
from io import StringIO 
import sys

class Capturing(list):
    def __enter__(self):
        self._stdout = sys.stdout
        sys.stdout = self._stringio = StringIO()
        return self
    def __exit__(self, *args):
        self.extend(self._stringio.getvalue().splitlines())
        del self._stringio    # free up some memory
        sys.stdout = self._stdout
## end snippet

def detect_esp(port):
    try:
        with Capturing() as output:
            esptool.main(["--port", port, "read_mac"])

        for line in output:
            if line.startswith('Chip is'):
                chip = line.split('Chip is ')[1].split(' ')[0]
                if chip in chip2board:
                    return chip2board[chip]
                else:
                    return chip
    except Exception as e:
        print("Error", e)
        return None

def esp_get_device_name(board: Board):
    board.name = detect_esp(board.port)

One of the challenges here is that esptool.py doesn’t include an API to simply return the chip designation. However, part of the standard output for any command includes that information. So we listen to the output when running the read_mac command and use simple string operations to get the chip designation.

golioth@orangepi3-lts:~$ esptool.py read_mac
esptool.py v4.6.2
Found 4 serial ports
Serial port /dev/ttyUSB0
Connecting....
Detecting chip type... Unsupported detection protocol, switching and trying again...
Connecting.....
Detecting chip type... ESP32
Chip is ESP32-D0WD-V3 (revision v3.0)
Features: WiFi, BT, Dual Core, 240MHz, VRef calibration in efuse, Coding Scheme None
Crystal is 40MHz
MAC: 90:38:0c:eb:0a:28
Uploading stub...
Running stub...
Stub running...
MAC: 90:38:0c:eb:0a:28
Hard resetting via RTS pin...

While this doesn’t tell us what board is connected, a lookup table of the ESP32 dev boards Golioth commonly tests makes it easy to identify the board. I feel like there should be a way to simply use the mac address to determine the chip type (after all, esptool.py is doing this somehow) but I haven’t yet found documentation about that.

Using Board Configuration

Now we know the board name, serial port, and serial number for each connected board. What should be done with that information? This is the topic of a future post. Generally speaking, we write it to a yaml file in the repository that is running the recognition workflow.

Yaml file showing board name, serial port, and serial number

In a future workflow, that information is written back to the self-hosted runner where it can be used by other workflows. And as part of that process, we assign labels such as has_nrf52840dk so GitHub knows which runners to use for any given test.

Take Golioth for a Test Drive

We do all this testing so that you know the Golioth platform is a rock solid way to manage your IoT fleet. Don’t reinvent the wheel, give Golioth’s free Dev Tier a try and get your fleet up and running fast.

IoT devices are usually battery-operated and, more often than not, need to run on a single battery charge for multiple years. Before we know it, MCU power consumption becomes a huge deal when developing a product. Measuring power consumption of an MCU can be challenging since it does not depend on just one thing. It depends on multiple factors like clock frequency, what is connected to the outputs, which peripherals are enabled, and also the different MCU-specific power modes in use.

With known average current consumption and battery voltage, we can easily calculate the lifetime of a battery charge. Notice the word average; we are interested in the average current consumption over some fixed time period. To maximize battery life, developers must minimize power consumption over the life of the product.

For this blog post, we are going to measure current consumption with the Power Profiler Kit II. We’ll utilize a well-known friend as the target, the nRF9160 DK.

Device Operating Modes

Most battery-powered devices spend much of their time asleep, waking up to perform their functions, and then going back to sleep. For these applications, battery life depends, at least, on three major aspects of the microcontroller:

  • Run/Active mode: device is using the most power (fetching sensor data, communication with the Cloud, running algorithms, etc.)
  • Standby/Idle mode: the “do-nothing” state (an idle thread, sleeping in a while loop, etc.)
  • Sleep Mode: deepest internal power saving mode the system can enter (System OFF mode in the case of nRF9160 DK)

The “do-nothing” state of the microcontroller often has many different forms.
Start-up time, that being how long it takes the microcontroller to go from the do-nothing state to running the application state and the application runtime. The current consumption also varies depending on the operating temperature and schedule of the tasks.

Let’s consider an example where a device needs to measure an ambient temperature from a BME280 sensor, send the reading over cellular to the cloud, and wait for a fixed time period. This is the gist of what we do with our Cold Chain Asset Tracker. Since ambient temperature is not changing rapidly, we can safely say that the fixed time period (task schedule), can be 5 minutes and that starting the measurement and sending data to the cloud takes 5 seconds. This means the duration of the Idle mode is 60 times more than the Active state (application duty cycle).

Current, Power and Energy Consumption

It is common for developers to start their processor power analysis by considering active processing power. Though it may seem counterintuitive, the power the microcontroller consumes when not operating is often more important than active processing power. Referring back to the remote sensing application, the system typically wakes up from standby mode once every 5 minutes, so it remains in standby mode more than 98% of the time.

Power is defined as:

P = I • V [W]

 

where I is the current drawn from the battery, and V is the battery voltage.

Energy is defined as:

Energy = I • V • Time = P • Time [Wh]

So energy is nothing more than Power used over a period of time. This can be expressed as [V • Ah], where, Ah is the charge stored in the battery (check the back of your power bank 🙂 ).

Lithium Ion Batteries are usually 3.7 V, with different charge stored in the battery (ranging from 100 mAh to 10 Ah or more).

Let’s consider an example. A battery has a rating of 2 Ah at 1.5V (typical AA/AAA battery); the energy stored in the battery is:

E = 2 • 1.5 = 3 [Wh]

If we connect a 1 Ω resistor to the battery (neglecting the internal battery resistance), and use Ohm’s law, it will draw 1.5 A of current. That’s a load of 2.25 W, meaning we’ll run out of battery charge after 1.3 hours.

Now that we know our device has to take advantage of its low-power mode and other techniques to lower power consumption, how do we measure the current consumption? That’s where the Power Profiles Kit 2 from Nordic comes in.

Power Profiler Kit II

The Power Profiler Kit II (PPK2) is a standalone unit, which can measure and optionally supply currents all the way from sub-μA and as high as 1 A on all Nordic DKs, in addition to external hardware. The PPK2 is powered via a standard 5 V USB cable, which can supply up to 500 mA of current. In order to supply up to 1 A of current, two USB cables are required.

An ampere meter-only mode, as well as a source mode (shown as AMP and source measure unit (SMU) respectively on the PCB) are supported. For the ampere meter mode, an external power supply must source VCC levels between 0.8 and 5 V to the device under test (DUT).

For the source mode, the PPK2 supplies VCC levels between 0.8 and 5 V and the on-board regulator supplies up to 1 A of current to external applications. It is possible to measure low sleep currents, higher active currents, as well as short current peaks on all Nordic DKs, in addition to external hardware.

Connecting the PPK2 to the nRF9160 DK is straightforward and explained on Nordic’s website.

Power Consumption of Golioth’s hello example

For the first example, we are going to measure the current consumption of Golioth’s hello sample without modifications and use the PPK2 in ampere meter mode.

We focus on three areas:

  1. Device connects to the cellular tower, Golioth Cloud, and sends a hello message
  2. Device sends a hello message to Golioth Cloud
  3. Device is in an Idle mode

From the image, the most power-intensive is the first area, where the nRF9160 DK is connecting to the cellular tower, then to Golioth Cloud, and finally, it sends a hello message. Once connected to Golioth, the device is provisioned and does not need to go through the process again (unless the IP address of the device changes or it’s using Connection ID). That’s the reason the power consumption in the second area is lower compared to the first area. Current spikes of ~100 mA come from the modem using its RF circuitry to transmit and receive data from the cellular tower. The third area is when the application is in Idle mode (there is a k_sleep call in the while loop in the sample), and the modem is not in use, with a sustained current consumption of ~35 mA.

For the second example, the hello sample is a bit altered; after five hello messages are sent with a five second delay between them, the Golioth system client is stopped, and lte_lc_offline API call is made, which sets the device to flight mode, disabling both transmit and receive RF circuits and deactivating LTE and GNSS services.

There are still current spikes when the modem is in use, but the current drawn in Idle mode (marked with 4) is 10 times smaller (~4 mA), which will prolong the battery life substantially.

Please note: the 4 mA number above had no optimizations in place and do not represent the low current capabilities of the nRF9160. We will be doing a followup article where we showcase just how low things can go!

 

Conclusion

While we were successful in lowering current consumption by a factor of 10 by disabling the modem, there are still gains to be found! In the next blog post, we’ll write more about Zephyr’s Power Mode Subsystem and how to utilize it to reduce current consumption even more!

As embedded developers, we’re consistently seeking ways to make our processes more efficient and our teams more collaborative. The magic ingredient? DevOps.Its origins stem from the need to break down silos between development (Dev) and operations (Ops) teams. DevOps fosters greater collaboration and introduces innovative processes and tools to deliver high-quality software and products more efficiently.

In this talk, we will explore how we can bring the benefits of DevOps into the world of IoT. We will focus on using GitHub Actions for continuous integration and delivery (CI/CD) while also touching on how physical device operations like shipping & logistics which can be streamlined using a DevOps approach.

Understanding the importance of DevOps in IoT is crucial to unlocking efficiencies and streamlining processes across any organization that manages connected devices. This talk, originally given at the 2023 Embedded Online Conference (EOC), serves as one of the many specialized talks freely accessible on the EOC site.

GitHub Actions for IoT

To illustrate how to put these concepts into practice, we’re going to look at a demo using an ESP32 with a feather board and Grove sensors for air quality monitoring. It’s important to note that while we utilize GitHub Actions in this instance, other CI tools like Jenkins or CircleCI can also be effectively used in similar contexts based on your team’s needs and preferences.

For this example we use GitHub Actions to automate the build and deployment process.

The two main components of our GitHub Actions workflow are ‘build’ and ‘deploy’ jobs. The ‘build’ job uses the pre-built GitHub Action for ESP-IDF to compile our code, and is triggered when a new tag is pushed or when a pull request is made. The ‘deploy’ job installs the Golioth CLI, authenticates with Golioth, uploads our firmware artifact, and releases that artifact to a set of devices over-the-air (OTA).

Imagine an organization that manages a fleet of remote air quality monitors across multiple cities. This GitHub Actions workflow triggers the build and deployment process automatically when the development team integrates new features or bug fixes into the main branch and tags the version. The updated firmware is then released and deployed to all connected air quality monitors, regardless of their location, with no additional logistics or manual intervention required. This continuous integration and deployment allows the organization to respond rapidly to changes and ensures that the monitors always operate with the latest updates.

Let’s delve into the GitHub Actions workflow and walk through each stage:

  1. Trigger: The workflow is activated when a new tag is pushed or a pull request is created.
    on:
      push:
        # Publish semver tags as releases.
        tags: [ 'v*.*.*' ]
      pull_request:
        branches: [ main ]
  2. Build: The workflow checks out the repository, builds the firmware using the ESP-IDF GitHub Action, and stores the built firmware artifact.
    jobs:
      build:
        runs-on: ubuntu-latest
        steps:
        - name: Checkout repo
          uses: actions/checkout@v3
          with:
            submodules: 'recursive'
        - name: esp-idf build
          uses: espressif/esp-idf-ci-action@v1
          with:
            esp_idf_version: v4.4.4
            target: esp32
            path: './'
          env:
            WIFI_SSID: ${{ secrets.WIFI_SSID }}
            WIFI_PASS: ${{ secrets.WIFI_PASS }}
            PSK_ID: ${{ secrets.PSK_ID }}
            PSK: ${{ secrets.PSK }}
        - name: store built artifact
          uses: actions/upload-artifact@v3
          with:
            name: firmware.bin
            path: build/esp-air-quality-monitor.bin
  3. Deploy: The workflow installs the Golioth CLI, authenticates with Golioth, downloads the built firmware artifact, and uploads it to Golioth for OTA updates.
    jobs:
      build:
        runs-on: ubuntu-latest
        steps:
        - name: Checkout repo
          uses: actions/checkout@v3
          with:
            submodules: 'recursive'
        - name: esp-idf build
          uses: espressif/esp-idf-ci-action@v1
          with:
            esp_idf_version: v4.4.4
            target: esp32
            path: './'
          env:
            WIFI_SSID: ${{ secrets.WIFI_SSID }}
            WIFI_PASS: ${{ secrets.WIFI_PASS }}
            PSK_ID: ${{ secrets.PSK_ID }}
            PSK: ${{ secrets.PSK }}
        - name: store built artifact
          uses: actions/upload-artifact@v3
          with:
            name: firmware.bin
            path: build/esp-air-quality-monitor.bin

For those eager to dive in and start implementing DevOps into their own IoT development process, we’ve provided an example Github Actions workflow file on GitHub. Feel free to fork this repository and use it as a starting point for streamlining your own IoT firmware development process. Remember, the best way to learn is by doing. So, get your hands dirty, experiment, iterate, and innovate. If you ever need help or want to share your experiences, please reach out in our our community forum.

Golioth collects data from your entire IoT sensor fleet and makes it easy to access from the cloud. Data visualization is a common use case and we love using Grafana to make dashboards for our fleets. It can start to get tricky if your devices are sending back mountains of different data. We’ll demonstrate how you can use Golioth REST API Queries to capture specific data endpoints from devices. By doing so, you can focus on the most important data points and utilize them to create powerful visual dashboards in Grafana.

Don’t Mix Your Endpoints

By far the easiest way to query streaming sensor data is just to ask for all of the events. But this presents an interesting challenge in Grafana. If your query response includes records that don’t have your desired endpoint, Grafana will see that as null and most often this will break your graph:

Grafana error screenshot

Grafana showing the “Fields have different lengths” error

This is because the graph is targetting the sensor endpoint, but if we look at Golioth we can see that the device is sending data to the battery endpoint in between sensor readings. The sensor endpoint doesn’t exist for those readings.

Battery and sensor data being received on Golioth's LightDB Stream service

Let’s diagnose, and fix this issue using the Query Builder button in the upper right of the LightDB Stream page of the Golioth Console.

Using Query Builder to Filter Data

As it stands now, the query is asking for all data from all events by using the * wildcard operator:

Query Builder getting all endpoints

What we really want is just one endpoint that we can graph (along with the time and deviceId). So let’s change the query builder to target the current measurement on channel 0 of the incoming sensor data. I’ll also use an alias to make it easy to access the returned value.

Targeting the sensor.cur.ch0 endpoint

This is a good start. But if you look at the data, we still haven’t solved the problem:

Golioth LightDB endpoints sometimes null

The dashes are the null values returned when trying to query our sensor data when it was a battery endpoint that was received. The final part of the puzzle is to add a filter that checks that our desired endpoint is not null.

Filtering out null values on an endpoint Now Golioth will only send data that includes the value we want to graph:

Correctly filtered sensor value

Now that we are receiving exactly the data we need, we can copy the query from the Query Builder by clicking the JSON button and then the Copy button.

Using the Query in Grafana

We previously set up a REST API data source for Golioth in our Grafana Dashboard. Use the Body tab to enter your query using the new values we created with the Query Builder.

Grafana JSON query

In the Fields tab we’re using the alias that we specified. You could use the full endpoint but as your panels become more intricate you’ll thank yourself for using aliases!

Fields for graphing values in Grafana

For completeness, we’ll leave you with the verbose body object from this Grafana panel. You can see the parts we crafted in the Query Builder make up the query field.

{
  "start": "${__from:date}",
  "end": "${__to:date}",
  "perPage": 99999,
  "query": {
    "fields": [
      {
        "path": "time"
      },
      {
        "path": "deviceId"
      },
      {
        "path": "sensor.cur.ch0",
        "alias": "cur0"
      }
    ],
    "filters": [
      {
        "path": "sensor.cur.ch0",
        "op": "<>",
        "value": "null"
      }
    ]
  }
}

Give Golioth a Try Today!

The proof of concept for your IoT fleet can be up and running in days, not weeks if you use Golioth as your instant IoT cloud. Combined with visualization tools like Grafana, you’ll have no problem getting to “yes” with your customers (or your boss). Your first 50 devices are free with our Dev Tier, so give Golioth a try today!

When I started at Golioth, I wanted to understand the service offerings from different companies in the space. What resulted, was a chart (below), slideshow (below), and a presentation I have now given a couple different times (video, above). All of these things showcase how I started to investigate and understand the inter-operating pieces of any modern Internet of Things (IoT) system.

Only 12 Layers?

Let’s start with the obvious: I could have chosen just about any number of layers to explain IoT systems.

I chose to highlight the pieces of IoT systems as someone working on building Reference Designs, which have a Device component and a Cloud component. I framed many of these capabilities from the perspective of my time at Golioth and the aspects of the system that I found to be crucial to a deployment. I wanted to understand the competitive landscape as a starting point (finding other companies providing IoT platforms) and expanded from there.

I stopped at 12, because that is roughly the amount I thought I could describe without audience members falling asleep. 12 layers also captures a good amount of functionality. However, it misses many other pieces of IoT implementation and very easily could have been the 100 layers of IoT.

The Chart

The true beginning of this talk was a diagram I created. It shows how I understood offerings from different IoT providers, with varying levels of “vertical integration” (how much of the solution they provide to their users). Some providers go all the way from the hardware up to the Cloud. Some provide key infrastructure pieces (like Golioth) and point you at external providers in order to give you more flexibility. Some are hyper specialized in one area.

The key idea was to point out that these providers exist and that there are different reasons to choose one over another. That is to say: each has a valid business proposition and fits a customer profile. One provider might serve an operations group that is looking to add connectivity to an existing business and simply spit data out onto the internet—many businesses want that. Other times, a technology group inside a larger product development company might be looking to supplement their product design capabilities and not do everything in-house. It’s important to understand the landscape as a design engineer evaluating what they should buy and what they should DIY. Each layer in the presentation has the pro’s and con’s of “Buy vs DIY”.

Beyond IoT Platform Providers

Unstated and unshown on the chart is the “DIY” of IoT subsystems. As mentioned in the video and slides, it’s possible that companies want to develop everything from the ground up and maintain all of their own IP. The downside is the high cost in terms of people and time, and often the “full DIY” method of developing is relegated to the largest companies looking to develop an IoT product.

Others are utilizing building blocks from hyper-scalers like AWS and Azure. Sometimes this takes the form of using hyper scaler IoT platforms (AWS IoT, Azure IoT), and other times, they use even more basic computing elements in the cloud (EC2s, S3 buckets, Lambdas). In all these cases, there is a significant amount of “Cloud Software” that is written and maintained by the company looking to develop an IoT product.

How do the 12 Layers of IoT impact you?

These resources exist to help engineers and business people new to the IoT industry to understand how to create successful IoT deployments. Most importantly, this talk sought to remedy the problem of “you don’t know what you don’t know” (reference). If you don’t know about potential pitfalls in deploying an IoT solution and what you might need 2 or 3 years down the line, you won’t be able to take steps up front to prevent those problems.

A tangible example is in “layer 10”, which is listed as “deployment flexibility”. If you want to hire an IoT platform to get your deployment off the ground, but later will want to run your own cloud infrastructure, you need to choose different options when creating your system. Platforms like Golioth allow you to run your own cloud infrastructure as part of our Enterprise plan. Companies that don’t choose this path at the beginning of their deployment find themselves re-architecting their entire solution (all the way down to the hardware) at a later time in order to implement a bespoke cloud solution that fits their needs.

The Presentation

Below is a refined version of the talk that I gave at All Things Open in Raleigh NC in November 2022. Unfortunately that version of the talk wasn’t recorded, but the slides below are the most up-to-date version I have.

Did I miss a layer?

I am continually refining my concept of what comprises IoT deployments and the required pieces. It’s possible I missed out on something important. Maybe there are critical pieces of infrastructure that you think I glossed over. We’d love to hear your thoughts on our forum or on social media (Twitter, LinkedIn, Facebook).

The Golioth Zephyr SDK is now 100% easier to use. The new v0.4.0 release was tagged on Wednesday and delivers all of the Golioth features with less coding work needed to use them in your application. Highlights include:

  • Asynchronous function/callback declaration is greatly simplified
    • User-defined data can now be passed to callbacks
  • Synchronous versions of each function were added
  • API is now CoAP agnostic (reply handling happens transparently)
  • User code no longer needs to register an on_message() callback
  • Verified with the latest Zephyr v3.2.0 and NCS v2.1.0

The release brings with it many other improvements that make your applications better, even without knowing about them. For the curious, check out the changelog for full details.

Update your code for the new API (or not)

The new API includes some breaking changes and to deal with this you have two main options:

  1. Update legacy code to use the new API
  2. Lock your legacy code to a previous Golioth version

1. Updating legacy code

The Golioth Docs have been updated for the new API, and reading through the Firmware section will give you a great handle on how everything works. The source of truth continues to be the Golioth Zephyr SDK reference (Doxygen).

Updating to the new API is not difficult. I’ve just finished working on that for a number of our hardware demos, including the magtag-demo repository we use for our Developer Training sessions. The structure of your program will remain largely the same, with Golioth function names and parameters being the most noticeable change.

Consider the following code that uses the new API to perform an asynchronous get operation for LightDB State data:

/* The callback function */
static int counter_handler(struct golioth_req_rsp *rsp)
{
    if (rsp->err) {
        LOG_ERR("Failed to receive counter value: %d", rsp->err);
        return rsp->err;
    }

    LOG_HEXDUMP_INF(rsp->data, rsp->len, "Counter (async)");

    return 0;
}

/* Register the LightDB Get callback from somewhere in your code */
static int my_function(void)
{
    int err;
    err = golioth_lightdb_get_cb(client, "counter",
                                 GOLIOTH_CONTENT_FORMAT_APP_JSON,
                                 counter_handler, NULL);
}

Previously, the application code would have needed to allocate a coap_reply, pass it as a parameter in the get function call, use the on_message callback to process the reply, then unpack the payload in the reply callback before acting on it. All of that busy work is gone now!

With the move to the v0.4.0 API, we don’t need to worry about anything other than:

  • Registering the callback function
  • Working with the data (or an error message) when we hear back from Golioth.

You can see the response struct makes the data itself, the length of the data, and the error message available in a very straightforward way.

A keen eye already noticed the NULL as the final parameter. This is a void * type that lets you pass your user-defined data to the callback. Any value that’s 4-bytes or less can be passed directly, or you can pass a pointer to a struct packed with information. Just be sure to be mindful of the memory allocation lifespan of what you pass.

All of the asynchronous API function calls follow this same pattern for callbacks and requests. The synchronous calls are equally simple to understand. I found the Golioth sample applications to be a great blueprint for updating the API calls in my application code. The changelog also mentions each API-altering commit which you may find useful for additional migration info.

The Golioth Forum is a great place to ask questions and share your tips and tricks when getting to know the new syntax.

2. Locking older projects to an earlier Golioth

While we recommend updating your applications, if you do have the option to continue using an older version of Golioth instead. For that, we recommend using a west manifest to lock your project to a specific version of Golioth.

Manifest files specify the repository URL and tag/hash/branch that should be checked out. That version is used when running west update, which then imports a version of Zephyr and all supporting modules specified in the Golioth SDK manifest to be sure they can build the project in peace and harmony.

By adding a manifest to your project that references the Golioth Zephyr SDK v0.3.1 (the latest stable version before the API change) you can ensure your application will build in the future without the need to change your code. Please see our forum thread on using a west manifest to set up a “standalone” repository for more information.

A friendlier interface improves your Zephyr experience

Version 0.4.0 of the Golioth Zephyr SDK greatly improves the ease-of-use when adding device management features to your IoT applications. Device credentials and a few easy-to-use APIs are all it takes to build in data handling, command and control, and Over-the-Air updates into your device firmware. With Golioth’s Dev Tier your first 50 devices are free so you can start today, and as always, get in touch with us if you have any questions along the way!

Controlling 10 devices is easy, controlling 10,000 is a different story. The trick is to plan for scale, which is what we specialize in here at Golioth.

A few weeks ago we announced the Golioth Device Settings Service that enables you to change settings for your entire fleet at the click of a button. Of course, you can drill down to groups of devices and single devices too. While the announcement post showed the Settings Service using our ESP-IDF SDK, today we’ll look at this setting service from the device point of view with Zephyr RTOS.

Device-side building blocks

The good news is that the Golioth Zephyr SDK has done all the hard work for you. We just need to do three things to start using the settings service in any Zephyr firmware project:

  1. Enable the settings service in KConfig
  2. Register a callback function when the device connects to Golioth
  3. Validate the data and do something useful when a new setting arrives

Code walk-through

Golioth’s settings sample code is a great starting point. It watches for a settings key named LOOP_DELAY_S to remotely adjust the delay between sending messages back to the server. Let’s walk through the important parts of that code for a better understanding of what is involved.

1. Enable settings in KConfig

CONFIG_GOLIOTH_SETTINGS=y

To turn the settings service on, the CONFIG_GOLIOTH_SETTINGS symbol needs to be selected. Add the code above to your prj.conf file.

2. Register a callback for settings updates

static void golioth_on_connect(struct golioth_client *client)
{
    if (IS_ENABLED(CONFIG_GOLIOTH_SETTINGS)) {
        int err = golioth_settings_register_callback(client, on_setting);

        if (err) {
            LOG_ERR("Failed to register settings callback: %d", err);
        }
    }
}

To do anything useful with the settings service, we need to be able to execute a function when new settings are received. Register the callback function each time the device connects to Golioth. Your app may already have a golioth_on_connect() function declared, look for the following code at the beginning of main and add it if it’s not already there.

	client->on_connect = golioth_on_connect;
	client->on_message = golioth_on_message;
	golioth_system_client_start();

Each time the Golioth Client detects a new connection, the golioth_on_connect() function will run, which in turn registers your on_setting() function to run.

3. Validate and process the received settings

enum golioth_settings_status on_setting(
        const char *key,
        const struct golioth_settings_value *value)
{
    LOG_DBG("Received setting: key = %s, type = %d", key, value->type);
    if (strcmp(key, "LOOP_DELAY_S") == 0) {
        /* This setting is expected to be numeric, return an error if it's not */
        if (value->type != GOLIOTH_SETTINGS_VALUE_TYPE_INT64) {
            return GOLIOTH_SETTINGS_VALUE_FORMAT_NOT_VALID;
        }

        /* This setting must be in range [1, 100], return an error if it's not */
        if (value->i64 < 1 || value->i64 > 100) {
            return GOLIOTH_SETTINGS_VALUE_OUTSIDE_RANGE;
        }

        /* Setting has passed all checks, so apply it to the loop delay */
        _loop_delay_s = (int32_t)value->i64;
        LOG_INF("Set loop delay to %d seconds", _loop_delay_s);

        return GOLIOTH_SETTINGS_SUCCESS;
    }

    /* If the setting is not recognized, we should return an error */
    return GOLIOTH_SETTINGS_KEY_NOT_RECOGNIZED;
}

The callback function will provide a key (the name of the settings) that is a string type, and a golioth_settings_value that includes the value, as well as an indicate of the value type (bool, float, string, or int64). The callback function should validate that the expected key was received, and test that the received value is the proper type and within your expected bounds before acting upon the data.

The final piece of the device settings puzzle is to return an enum value indicating the device status after receiving a new setting. This is used by the Golioth Cloud to indicate sync status. Our example demonstrates the GOLIOTH_SETTINGS_SUCCESS and GOLIOTH_SETTINGS_KEY_NOT_RECOGNIZED status values, but it’s worth checking out the doxygen reference for golioth_settings_status to see how different invalid key and value messages can be sent to Golioth for display in the web console.

Coordinating with the Golioth Cloud

The device settings key format is tightly specified by Golioth, allowing only “A-Z (upper case), 0-9 and underscore (_) characters”. We recommend that you add a key to the Golioth web console while writing your firmware to ensure the device knows what to expect.

Each device page includes a status indicator to show if the settings are in sync with the Golioth cloud.

Additional information is available when a device is out of sync. Here you can see that hovering over the information icon indicates the key is not valid. For this example I created a VERY_IMPORTANT_BOOL_7 but didn’t add support for that to the firmware. This is the result of our callback function returning GOLIOTH_SETTINGS_KEY_NOT_RECOGNIZED. In cases like this, Golioth’s Over-the-Air firmwmare update (OTA) service can help you to push a new firmware to your devices in the field that will recognize the newly created Settings variable.

IoT Fleet control on day one

The pain of IoT often comes when you need to scale. Golioth’s mission is to make sure you avoid that pain and are ready to grow starting from day one.

Our device settings service gives you the power to remotely change settings on your entire fleet, on a group of devices, and all the way down to a single device. You are able to verify your devices are in sync, and receive helpful information when they are not. Take Golioth for a test drive today, our dev tier is free for the first 50 devices.

We use Grafana at Golioth to help us showcase data for internal and external demos. As in any IoT deployment, the hardware does not exist just in the physical realm; it also sends data back to the internet and we need some way to show that data. Grafana is an open source solution that can run locally on your computer, on your own cloud, or on the Grafana Cloud. We have written about our past use of Grafana, including our custom WebSockets plugin that makes it easier to see changes to your data as it happens.

When you’re ready to highlight changes from a device, graphics can make it easier to understand what’s happening. Let’s take a look at how to add dynamic graphics to your Grafana Dashboard.

Why you should use graphics

These graphics convey immediate understanding for magnet sensor state

Humans are a visual species. Our eyes naturally track movement in nature, but that also translates to computer screens. Adding some dynamic graphics will not only grab attention, but allow users to quickly assess if there are problems. A quick scan over a graphical interface allows you to take fast action on critical changes to your deployment.

In the “Red Demo” video below, you can see the device detecting a change in proximity sensor state, which could indicate something like a break-in at a home or business. A dedicated graphic on the dashboard clearly indicates this change to the user.

Another reason to look at simplified graphics is the changing nature of your IoT Fleet Deployment. As you continue to add people to a project, the knowledge around the technical details often diffuses; the people that designed and deployed the firmware and hardware aren’t always the ones maintaining the deployment. You want to design a dashboard that anyone can walk up to and understand the current state of the device, regardless of their technical background. Most importantly, you want to indicate whether any action needs to be taken. This could include adding callouts to your graphics or other indicators that the device is nominal or in an error state. It’s much beyond this article, but there are books on proper design of Human Machine Interfaces and case studies of how they have gone wrong.

Finally, let’s compare graphics to the common alternative in dashboards: graphs. While it’s useful to see data changing over time, a new user won’t have context for the changes. If they walk up to a dashboard and glance at a moving chart of temperature, there’s no way to understand why it’s being shown and required actions (“the chart just went up, was that good or bad?”). Even if you start with stateful based indicators like dials and readouts, these can be enhanced by adding contextual information directly to the graphics.

How to add dynamic graphics

First, you will need a Grafana instance, configured to access the datasource of your Golioth deployment. We have found it makes the most sense to highlight the current state of the device by using a WebSockets source, instead of polling the REST API on a regular interval. Using WebSockets will ensure you are immediately notified of the changes as they propagate through your device deployment.

Plugin

To start, you need to install the “Dynamic Image Panel” Plugin from the Grafana plugin marketplace. This is a signed plugin, and should be safe for all installs.  In the example, they showcase a weather application, talking to an API that fetches different weather condition graphics. We will be creating a slightly different implementation, based on hosted images on a cloud server without an API. This is not as robust generally, but is much simpler as an implementation.

Installing the plugin from the Grafana marketplace (available locally or on the cloud)

Hosted Graphics

Next you need to put the associated graphics somewhere that the Grafana instance will be able to access them. In our case, we used Firebase hosting to create a static file location. This can be any service that allows public access to the images, but we do caution you to test that you can access the images when you’re not logged in, or your dashboard may not be able to pull in those images you created.

Assigning graphics to device states

We created the “Red Demo” graphics to correspond to proximity sensors based on sensing a magnetic field. The device is reporting the state of that mag sensor over LightDB Stream, along with temperature and accelerometer data on a regular basis (or on-demand when the button is pushed, as shown in the demo video inset above). The logic is reverse for this sensor (‘0’ means there is a magnet present, ‘1’ means it is not), which we maintained throughout the system. We then created graphics that append that reading to a “base” address to access each individual image. When there is no reading available, it defaults to the NULL case, which means nothing is appended to the “base” address, and we show that it is in an unknown state (far left image).On the plugin, you can customize all of this and make it match your needs. Instead of ‘1’ and ‘0’, a device might be sending linear values from 0-10, or it might be sending “true” or “false” (boolean), or even sending back general strings. The main thing to consider when designing your graphics is that you want your device and the Golioth Cloud to categorize physical readings into “buckets” that are easy to display.

Creating “movement” in graphics

Sometimes you’ll want to make changes in state a bit more subtle, especially if your device is reporting more than a binary state (‘on’/’off’). In Golioth’s recently released “Green Demo”, you’ll see there are 6 states for a lamp used in our Greenhouse example. The graphics change along with the intensity of the light coming out of the lamp, while also overlaying a numeric value.

Enhance your next dashboard

Using graphics in your next IoT dashboard can help you add interesting informational content, alert your users, or just spice up what is normally a set of boring charts. If you want help getting your next dashboard started, reach out on our forum, our discord, or send us a note.

This is the second part of How Golioth uses Hardware-in-the-Loop (HIL) Testing. In the first part, I covered what HIL testing is, why we use it at Golioth, and provided a high-level view of the system.

In this post, I’ll talk about each step we took to set up our HIL. It’s my belief that HIL testing should be a part of any firmware testing strategy, and my goal with this post is to show you that it’s not so difficult to get a minimum viable HIL set up within one week.

The Big Pieces

To recap, this is the system block diagram presented at the end of part 1:

The Raspberry Pi acts as a GitHub Actions Self-Hosted Runner which communicates with microcontroller devices via USB-to-serial adapters.

The major software pieces required are:

  • Firmware with Built-In-Test Capabilities – Runs tests on the device (e.g. ESP32-S3-DevKitC), and reports pass/fail for each test over serial.
  • Python Test Runner Script – Initiates tests and checks serial output from device. Runs on the self-hosted runner (e.g. Raspberry Pi).
  • GitHub Actions Workflow – Initiates the python test runner and reports results to GitHub. Runs on the self-hosted runner.

Firmware Built-In-Test

To test our SDK, we use a special build of firmware which exposes commands over a shell/CLI interface. There is a command named built_in_test which runs a suite of tests in firmware:

esp32> built_in_test
...
12 Tests 0 Failures 0 Ignored 
../main/app_main.c:371:test_connects_to_wifi:PASS
../main/app_main.c:378:test_golioth_client_create:PASS
../main/app_main.c:379:test_connects_to_golioth:PASS
../main/app_main.c:381:test_lightdb_set_get_sync:PASS
../main/app_main.c:382:test_lightdb_set_get_async:PASS
../main/app_main.c:383:test_lightdb_observation:PASS
../main/app_main.c:384:test_golioth_client_heap_usage:PASS
../main/app_main.c:385:test_request_dropped_if_client_not_running:PASS
../main/app_main.c:386:test_lightdb_error_if_path_not_found:PASS
../main/app_main.c:387:test_request_timeout_if_packets_dropped:PASS
../main/app_main.c:388:test_client_task_stack_min_remaining:PASS
../main/app_main.c:389:test_client_destroy_and_no_memory_leaks:PASS

These tests cover major functionality of the SDK, including connecting to Golioth servers, and sending/receiving CoAP messages. There are also a couple of tests to verify that stack and heap usage are within allowed limits, and that there are no detected memory leaks.

Here’s an example of one of the tests that runs on the device, test_connects_to_golioth:

#include "unity.h"
#include "golioth.h"

static golioth_client_t _client;
static SemaphoreHandle_t _connected_sem;

static void on_client_event(golioth_client_t client, 
                            golioth_client_event_t event, 
                            void* arg) {
    if (event == GOLIOTH_CLIENT_EVENT_CONNECTED) {
        xSemaphoreGive(_connected_sem);
    }
}

static void test_connects_to_golioth(void) {
    TEST_ASSERT_NOT_NULL(_client);
    TEST_ASSERT_EQUAL(GOLIOTH_OK, golioth_client_start(_client));
    TEST_ASSERT_EQUAL(pdTRUE, xSemaphoreTake(_connected_sem, 10000 / portTICK_PERIOD_MS));
}

This test attempts to connect to Golioth servers. If it connects, a semaphore is given which the test waits on for up to 10 seconds (or else the test fails). For such a simple test, it covers a large portion of our SDK, including the network stack (WiFi/IP/UDP), security (DTLS), and CoAP Rx/Tx.

We use the Unity test framework for basic test execution and assertion macros.

If you’re curious about the details, you can see the full test code here: https://github.com/golioth/golioth-firmware-sdk/blob/main/examples/esp_idf/test/main/app_main.c

Python Test Runner

Another key piece of software is a Python script that runs on the self-hosted runner and communicates with the device over serial. We call this script verify.py. In this script, we provision the device with credentials, issue the built_in_test command, and verify the serial output to make sure all tests pass.

Here’s the main function of the script:

def main():
    port = sys.argv[1]

    # Connect to the device over serial and use the shell CLI to interact and run tests
    ser = serial.Serial(port, 115200, timeout=1)

    # Set WiFi and Golioth credentials over device shell CLI
    set_credentials(ser)
    reset(ser)

    # Run built in tests on the device and check output
    num_test_failures = run_built_in_tests(ser)
    reset(ser)

    if num_test_failures == 0:
        run_ota_test(ser)

    sys.exit(num_test_failures)

This opens the serial port and calls various helper functions to set credentials, reset the board, run built-in-tests, and run an OTA test. The exit code is the number of failed tests, so 0 means all passed.

The script spends most of its time in this helper function which looks for a string in a line of serial text:

def wait_for_str_in_line(ser, str, timeout_s=20, log=True):
    start_time = time()
    while True:
        line = ser.readline().decode('utf-8', errors='replace').replace("\r\n", "")
        if line != "" and log:
            print(line)
        if "CPU halted" in line:
            raise RuntimeError(line)
        if time() - start_time > timeout_s:
            raise RuntimeError('Timeout')
        if str in line:
            return line

The script will continually read lines from serial for up to timeout_s seconds until it finds what it’s looking for. And if “CPU halted” ever appears in the output, that means the board crashed, so a runtime error is raised.

This verify.py script is what gets invoked by the GitHub Actions Workflow file.

GitHub Actions Workflow

The workflow file is a YAML file that defines what will be executed by GitHub Actions. If one of the steps in the workflow returns a non-zero exit code, then the workflow will fail, which results in a red ❌ in CI.

Our workflow is divided into two jobs:

  • Build the firmware and upload build artifacts. This is executed on cloud servers, since building the firmware on a Raspberry Pi would take a prohibitively long time (adding precious minutes to CI time).
  • Download firmware build, flash the device, and run verify.py. This is executed on the self-hosted runner (my Raspberry Pi).

The workflow file is created in .github/workflows/test_esp32s3.yml. Here are the important parts of the file:

name: Test on Hardware

on:
  push:
    branches: [ main ]
  pull_request:

jobs:
  build_for_hw_test:
    runs-on: ubuntu-latest
    steps:
    ...
    - name: Build test project
      uses: espressif/esp-idf-ci-action@v1
      with:
        esp_idf_version: v4.4.1
        target: esp32s3
        path: 'examples/test'
    ...

  hw_flash_and_test:
    needs: build_for_hw_test
    runs-on: [self-hosted, has_esp32s3]
    ...
    - name: Flash and Verify Serial Output
      run: |
        cd examples/test
        python flash.py $CI_ESP32_PORT && python verify.py $CI_ESP32_PORT

The first job runs in a normal GitHub runner based on Ubuntu. For building the test firmware, Espressif provides a handy CI Action to do the heavy lifting.

The second job has a needs: build_for_hw_test requirement, meaning it must wait until the first job completes (i.e. firmware is built) before it can run.

The second job has two further requirements – it must run on self-hosted and there must be a label has_esp32s3. GitHub will dispatch the job to the first available self-hosted runner that meets the requirements.

If the workflow completes all jobs, and all steps return non-zero exit code, then everything has passed, and you get the green CI checkmark ✅.

Self-Hosted Runner Setup

With all of the major software pieces in place (FW test framework, Python runner script, and Actions workflow), the next step is to set up a machine as a self-hosted runner. Any machine capable of running one of the big 3 operating systems will do (Linux, Windows, MacOS).

It helps if the machine is always on and connected to the Internet, though that is not a requirement. GitHub will simply not dispatch jobs to any self-hosted runners that are not online.

Install System Dependencies

There are a few system dependencies that need to be installed first. On my Raspberry Pi, I had to install these:

sudo apt install \
    python3 python3-pip python3-dev python-dev \
    libffi-dev cmake direnv wget gcovr clang-format \
    libssl-dev git
pip3 install pyserial

Next, log into GitHub and create a new self-hosted runner for your repo (Settings -> Actions -> Runners -> New self-hosted runner):

When you add a new self-hosted runner, GitHub will provide some very simple instructions to install the Actions service on your machine:

When I ran the config.sh line, I added --name and --unattended arguments:

./config.sh --url <https://github.com/golioth/golioth-firmware-sdk> --token XXXXXXXXXXXXXXXXXXXXXXXXXXXXX --name nicks-raspberry-pi --unattended

Next, install the service, which ensures GitHub can communicate with the self-hosted runner even if it reboots:

cd actions-runner
sudo ./svc.sh install

If there are any special environment variables needed when executing workflows, these can be added to runsvc.sh. For instance, I defined a variable for the serial port of the ESP32:

# insert anything to setup env when running as a service
export CI_ESP32_PORT=/dev/ttyUSB0

Finally, you can start the service:

sudo ./svc.sh start

And now you should see your self-hosted runner as “Idle” in GitHub Actions, meaning it is ready to accept jobs:

I added two custom labels to my runner: has_esp32s3 and has_nrf52840dk, to indicate that I have those two dev boards attached to my self-hosted runner. The workflow file uses these labels so that jobs get dispatched to self-hosted runners only if they have the required board.

And that’s it! With those steps, a basic HIL is up and running, integrated into CI, and covering a large amount of firmware code on real hardware.

Security

There’s an important note from GitHub regarding security of self-hosted runners:

Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

I also stand by that recommendation. But it’s not always feasible if you’re maintaining an open-source project. Our SDKs are public, so this warning certainly applies to us.

If you’re not careful, you could give everyone on the Internet root access to your machine! It wouldn’t be very hard for someone to submit a pull request that modifies the workflow file to run whatever arbitrary Linux commands they want.

We mitigate most of the security risks with three actions.

1. Require approval to run workflows for all outside contributors.

This means someone from the Golioth organization must click a button to run the workflow. This allows the repo maintainer time to review any workflow changes before allowing the workflow to run.

This option can be enabled in Settings -> Actions -> General:

Once enabled, pull requests will have a new “Approve and Run” button:

This is not ideal if you’re running a large open-source project with lots of outside contributors, but it is a reasonable compromise to make during early stages of a project.

2. Avoid logging sensitive information

The log output of any GitHub Actions runner is visible publicly. If there is any sensitive information printed out in the workflow steps or from the serial output of the device, then this information could easily fall into the wrong hands.

For our purposes, we disabled logging WiFi and Golioth credentials.

3. Utilize GitHub Actions Secrets

You can use GitHub Actions Secrets to store sensitive information that is required by the runner. In our repo, we define a secret named `GOLIOTH_API_KEY` that allows us access to the Golioth API, which is used to deploy OTA images of built firmware during the workflow run.

If you want to know more about self-hosted runner security, these resources were helpful to me:

Tips and Tricks

Here are a few extra ideas to utilize the full power of GitHub Actions Self-Hosted Runners (SHRs):

  • Use the same self-hosted runner in multiple repos. You can define SHRs at the organization level and allow their use in more than one repo. We do this with our SHRs so they can be used to run tests in the Golioth Zephyr SDK repo and the Golioth ESP-IDF SDK repo.
  • Use labels to control workflow dispatch. Custom labels can be applied to a runner to control when and how a workflow gets dispatched. As mentioned previously, we use these to enforce that the SHR has the dev board required by the test. We have a label for each dev board, and if the admin of the SHR machine has those dev boards attached via USB, then they’d add a label for each one. This means that machine will only receive workflows that are compatible with that machine. If a board is acting up or requires maintenance, the admin can simply disconnect the board and remove the label to prevent further workflows from executing.
  • Distributed self-hosted runners to load balance and minimize CI bottlenecks. If there is only one self-hosted runner, it can become a bottleneck if there are lots of jobs to run. Also, if that runner loses Internet, then CI is blocked indefinitely. To balance the load and create more robust HIL infrastructure, it’s recommended to utilize multiple self-hosted runners. Convince your co-workers to take their unused Intel NUC off the shelf and follow the steps in the “Self-Hosted Runner Setup” section to add another runner to the pool. This allows for HIL tests to run in a distributed way, no matter where in the world you are.