NAT is the Enemy of Low Power Devices

If you have ever tried communicating with a device on a private network, you may have encountered Network Address Translation (NAT). Fundamentally, when one device needs to send data to another device, it needs to know how to address it. On IP-based networks, devices are addressed using an IP address. Unfortunately, the number of connected devices has long outpaced the number of unique addresses in the IPv4 address space. Because of this, public IP addresses have to be shared between devices, which causes a few problems.

How to Share an IP Address

You probably accessed this blog post from a machine that does not have a public IP address. Rather, it has been assigned a private IP address on a network, perhaps via the Dynamic Host Configuration Protocol (DHCP), and it talks to a router that is responsible for sending data to and from the device. To access this post, your device first had to use the Domain Name System (DNS) to map a public IP address to blog.golioth.io, then had to send a request to that IP address for the content of this page.

NAT - Device to Cloud

When that request arrives at a router or some other intermediary, it knows where to deliver the request because the IP address of the server hosting blog.golioth.io is specified. It forwards the request along, and the server responds with the requested content. However, the server does not know that your device sent the request. The router has replaced the private IP address and port from your device with its own public IP address and port, and it has made an entry in a translation table noting that incoming data for that port should be directed to your device. The server sends the content back to the router, which replaces its own public IP address and port with your device’s IP address and port, then forwards it along. The content arrives at your device, appearing as though the server sent it directly to you. Meanwhile, the router is doing the same song and dance for many other devices, maintaining all of the mappings from its own IP address and ports to internal IP addresses and ports. This is known as Network Address Translation (NAT).

NAT - Cloud to Device

What Could Go Wrong?

This works great in simple request-response scenarios like fetching a blog post from a server with a public IP address. However, what if the server wants to say something to the device before the device talks to it? The server may know the public IP address of the router, but the router has no way of knowing which device the message is actually intended for. There is no entry in the NAT translation table until an outgoing message creates one. This also becomes a problem in peer-to-peer scenarios, where both devices are on a private network, making it such that neither device can talk to the other (this is solved using a public rendezvous point, such as a STUN server, but that’s a story for another post).

NAT - Cloud to Device, Cloud Initiates

Another problem is that routers don’t want to maintain mappings forever. At some point if no outgoing messages have been observed, the entry will be removed from the translation table and any subsequent incoming traffic will be dropped. In many cases, this timeout is quite aggressive (e.g. 5 minutes or less). Typically this is resolved by sending “keep alive” messages, ensuring that entries are not removed and data can flow freely in both directions. For your laptop or a server in a data center that might work fine for the most part. For highly constrained devices, it can quickly drain battery or consume precious limited bandwidth.

NAT - Cloud to Device Timeout

Maybe you decide that its okay for incoming traffic to be dropped after some period of time, as long as when you next contact the server you are able to re-establish a mapping and fetch any data that you need. Unfortunately, there is no guarantee that the router, or any other layer in the hierarchy of intermediaries performing NAT (it’s actually much more complicated, with Carrier-Grade NAT adding even more translation steps), will assign you the same public IP address and port. Therefore, when you try to continue talking to the server over a previously established session, it will not recognize you. This means you’ll have to re-establish the session, which typically involves expensive cryptographic operations and sending a handful of messages back and forth before actually delivering the data you were interested in sending originally.

NAT - Device to Cloud, New Session

The worst case scenario is that your device needs to send data somewhat frequently, but not frequently enough that NAT mappings are maintained. For example, if a device needs to send a tiny sensor reading every 30 minutes, and the NAT timeout is 5 minutes, it will either need to send a keep alive message every 5 minutes (that’s 5x the messages you actually need to send!), or it will need to re-establish the session every time it delivers a reading. In both cases, you are going to be using much more power than if you were just able to send your sensor reading alone every 30 minutes.

Solving the Problem

Unfortunately, the distributed nature of the internet means that we aren’t going to be able to address the issue by nicely asking carriers and ISPs to extend their NAT timeouts. However, we can make it such that being issued a new IP address and port doesn’t force us to re-establish a session.

More than a year ago, we announced support for DTLS 1.2 Connection IDs. DTLS provides a secure transport over UDP, which many devices, especially those that are power constrained, use to communicate with Golioth’s CoAP device APIs. Typically, DTLS sessions are established based on a “five tuple”: source address, source port, transport protocol, destination address, destination port. If any of these change, a handshake must be performed to establish a new session. To mitigate this overhead, a Connection ID can be negotiated during the initial handshake, and can be used in subsequent records to continue to associate messages even after changes in source IP or port.

NAT - DTLS Connection ID

Going back to our previous example of a device that sends a single sensor reading message every 30 minutes, enabling Connection ID would mean that a new handshake would not have to be performed after NAT timeout, and that single message can be sent then the device can go back to sleep. In fact, depending on how long the server is willing to store connection state, the device could sleep for much longer, sending once a day or more infrequently. This doesn’t solve the issue of cloud to device traffic being dropped after NAT timeout (check back for another post on that topic), but for many low power use cases, being able to sleep for an extended period of time is less important than being able to immediately push data to devices.

Configuring the Golioth Firmware SDK for Sleepy Devices

By default, the Golioth Firmware SDK will send keep alive messages to ensure that an entry is preserved in the NAT translation table. However, this functionality can be disabled by setting CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL to 0, or just modifying it to be set to a large upper bound.

CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL_S=0

If using Zephyr, we’ll also need to set the receive timeout to a value greater than the interval at which we will be sending data. Otherwise, the client will attempt to reconnect after 30 seconds by default if it has not received any messages. In this example we’ll send data every 130 seconds, so setting the receive timeout to 200 ensures that we won’t attempt to reconnect between sending.

CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC=200

To demonstrate the impact of NAT timeouts, we’ll initially build the hello example without enabling Connection IDs. To ensure that we wait long enough for a NAT timeout, we need to update the loop to send every 130 seconds instead of every 5 seconds.

This example is using a Hologram SIM and connecting via the AT&T network. NAT timeouts may vary from one carrier to another. AT&T currently documents UDP inactivity timeouts as 30 seconds.

while (true)
{
    LOG_INF("Sending hello! %d", counter);

    ++counter;
    k_sleep(K_SECONDS(130));
}

Building and flashing the hello sample on a Nordic Thingy91 results in the following behavior.

*** Booting nRF Connect SDK v2.7.0-5cb85570ca43 ***
*** Using Zephyr OS v3.6.99-100befc70c74 ***
[00:00:00.506,378] <dbg> hello_zephyr: main: start hello sample
[00:00:00.506,378] <inf> golioth_samples: Bringing up network interface
[00:00:00.506,408] <inf> golioth_samples: Waiting to obtain IP address
[00:00:13.236,877] <inf> lte_monitor: Network: Searching
[00:00:17.593,994] <inf> lte_monitor: Network: Registered (roaming)
[00:00:17.594,696] <inf> golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
[00:00:18.839,904] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:00:18.840,118] <inf> hello_zephyr: Sending hello! 0
[00:00:18.840,179] <inf> hello_zephyr: Golioth client connected
[00:00:18.840,270] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[00:02:28.840,209] <inf> hello_zephyr: Sending hello! 1
[00:02:32.194,396] <wrn> golioth_coap_client: 1 resends in last 10 seconds
[00:02:46.252,868] <wrn> golioth_coap_client: 4 resends in last 10 seconds
[00:03:03.419,219] <wrn> golioth_coap_client: 3 resends in last 10 seconds
[00:03:04.986,389] <wrn> golioth_coap_client: Packet 0x2001e848 (reply 0x2001e890) was not replied to
[00:03:06.045,715] <wrn> golioth_coap_client: Packet 0x2001e638 (reply 0x2001e680) was not replied to
[00:03:15.213,592] <wrn> golioth_coap_client: 6 resends in last 10 seconds
[00:03:21.874,298] <wrn> golioth_coap_client: Packet 0x2001ec90 (reply 0x2001ecd8) was not replied to
[00:03:25.419,921] <wrn> golioth_coap_client: 5 resends in last 10 seconds
[00:03:36.565,765] <wrn> golioth_coap_client: 5 resends in last 10 seconds
[00:03:40.356,933] <wrn> golioth_coap_client_zephyr: Receive timeout
[00:03:40.356,964] <inf> golioth_coap_client_zephyr: Ending session
[00:03:40.356,994] <inf> hello_zephyr: Golioth client disconnected
[00:03:47.035,675] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:03:47.035,705] <inf> hello_zephyr: Golioth client connected
[00:03:47.035,827] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop

After initially connecting and successfully sending Sending hello! 0, we are inactive for 130 seconds (00:18 to 02:28), then when we attempt to send Sending hello! 1, we see that the server never responds, eventually causing us to reach the Receive timeout and reconnect. This is because when we send Sending hello! 1, our entry has been removed from the NAT translation table, and when we are assigned a new public IP address and port the server is unable to associate messages with the existing DTLS session.

Because using Connection IDs does involve sending extra data in every message, it is disabled in the Golioth Firmware SDK by default. In scenarios such as this one where the few extra bytes clearly outweigh more frequent handshakes, Connection IDs can be enabled with CONFIG_GOLIOTH_USE_CONNECTION_ID.

CONFIG_GOLIOTH_USE_CONNECTION_ID=y

Now when we build and flash the hello example on a Thingy91, we can see our 130 second delay, but then the successful delivery of Sending hello! 1. 130 seconds later, we see another successful delivery of Sending hello! 2.

*** Booting nRF Connect SDK v2.7.0-5cb85570ca43 ***
*** Using Zephyr OS v3.6.99-100befc70c74 ***
[00:00:00.508,636] <dbg> hello_zephyr: main: start hello sample
[00:00:00.508,666] <inf> golioth_samples: Bringing up network interface
[00:00:00.508,666] <inf> golioth_samples: Waiting to obtain IP address
[00:00:13.220,001] <inf> lte_monitor: Network: Searching
[00:00:16.318,908] <inf> lte_monitor: Network: Registered (roaming)
[00:00:16.319,641] <inf> golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
[00:00:21.435,180] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:00:21.435,394] <inf> hello_zephyr: Sending hello! 0
[00:00:21.435,424] <inf> hello_zephyr: Golioth client connected
[00:00:21.435,546] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[00:02:31.435,455] <inf> hello_zephyr: Sending hello! 1
[00:04:41.435,546] <inf> hello_zephyr: Sending hello! 2

Next Steps

To see how often your devices are being forced to reconnect to Golioth after periods of inactivity, check out our documentation on device connectivity metrics. Devices that effectively maintain long lasting connections will see a significant difference between their Session Established and Last Report timestamps. If you have any questions about optimizing your devices for low power, reach out to us on the forum!

Zephyr has all of the bells and whistles. Your project only needs a handful of them. But which handful? To be fair, you can build with every possible module in your local tree and only the necessary bits will be pulled in. But wouldn’t it be nice to know exactly which modules need to be added to a manifest allow list? Answer that question and your users won’t be stuck cloning tons of unnecessary files. That could save time on each build, which really adds up over the course of a project’s life.

The west meta-tool used by Zephyr includes a package management system based on manifest files, often called west.yml. Part of the power of this system is that manifest files may inherit other manifest files. The downside to this is that you may be cloning a large number of packages your project will never use. Limit this by using an allow-list in your manifest. But what packages do you need to add to your allow list?

There is no answer to this question

Let’s be up-front about this: there is no definitive answer to this question.

Your project needs to allow all of the modules it uses. Sometimes that means modules that are enabled for some builds and disabled for others. For instance, the Golioth Firmware SDK includes example apps that will build for Espressif, Nordic, and NXP processors. Each have their own HAL but only one of them is used in any given build. You can’t really programmatically generate a modules list in a case like this; you just need to know these packages are needed, even if currently not in the build.

Even without an automated tool, I’ve had to answer this question for myself and I have some pointers on how to approach the problem.

The low-hanging fruit: check your build directory

The first thing you need to do is make sure your project builds without an allow list. That all files inherited from Zephyr or from NCS (Nordic’s Zephyr-based nRF Connect SDK) will be included from the build.

manifest:
  projects:
    - name: zephyr
      revision: v3.7.0
      url: https://github.com/zephyrproject-rtos/zephyr  
      west-commands: scripts/west-commands.yml
      import: true

This manifest will include dozens of modules available from the upstream Zephyr repository. There isn’t actually anything wrong with that. You clone the modules once and they live on your hard drive. But, it does take a long time to clone all of them and it will occupy several gigabytes of space. And it’s a good practice to know exactly which packages are actually in use. So let’s try to limit what is cloned in the future.

Directory listing with a few dozen Zephyr modules names shown

The build/modules directory from a Zephyr app

Above is a listing of the build/modules directory from a Zephyr application. All of these modules were scanned during the build process, but almost none of them have any object files that will be used in the build.

├── hal_rpi_pico 
│   ├── CMakeFiles 
│   └── cmake_install.cmake 
├── hal_silabs 
│   ├── CMakeFiles 
│   └── cmake_install.cmake 
├── hal_st 
│   ├── CMakeFiles 
│   └── cmake_install.cmake 
├── hal_telink 
│   ├── CMakeFiles 
│   └── cmake_install.cmake

In fact, we can use this to help us find the modules that are actually at work in a project. Here’s a one-liner you can run from the build/modules directory to get a list of modules we know are needed for this build:

➜ find . -type f -not -name "cmake_install.cmake" | cut -d/ -f2 | uniq
mbedtls
golioth-firmware-sdk
zcbor
hal_nxp

Let’s add this these modules to an allow-list and move to the next step.

manifest:
  projects:
    - name: zephyr
      revision: v3.7.0
      url: https://github.com/zephyrproject-rtos/zephyr
      west-commands: scripts/west-commands.yml
      import:
        name-allowlist:
          - mbedtls
          - zcbor
          - hal_nxp

The trial-and-error step

Okay, the easy part is behind us. Now it’s time to figure things out the hard way. Begin by removing your module sources. These are usually in a modules directory that is a sibling of the zephyr directory where the Zephyr tree is stored. Check carefully that you do not have any uncommitted changes in these modules before removing them from your local storage. (I’ve learned this the hard way.)

Next, add an allow-list with the modules we found in the previous section. Run west update to clone the modules. This should happen rather quickly as we’ve greatly narrowed down what will be checked out. Try to build your application. If it fails, we need to divine which module was missing from the build and add that to the allow-list.

warning: HAS_CMSIS_CORE (defined at modules/cmsis/Kconfig:7) has direct dependencies 0 with value n, but is currently being y-selected by the following symbols:
 - CPU_CORTEX_M (defined at arch/arm/core/Kconfig:6), with value y, direct dependencies ARM (value: y), and select condition ARM (value: y)

Part the build error is pointing to a modules/cmsis. If you look in the west.yml from the Zephyr tree you’ll see there is indeed a module named cmsis. We can add to our allow list, run `west update`, and then rebuild.

Guess what? That was it… the project now builds! Here’s what my entire manifest looks like:

manifest:
  projects:
    - name: zephyr
      revision: v3.7.0
      url: https://github.com/zephyrproject-rtos/zephyr
      west-commands: scripts/west-commands.yml
      import:
        name-allowlist:
          - mbedtls
          - zcbor
          - hal_nxp
          - cmsis

  self:
    path: modules/lib/golioth-firmware-sdk
    west-commands: scripts/west-commands.yml
    userdata:
      patches_dirs:
        - patches/west-zephyr

Note that the golioth-firrmware-sdk was one of the modules our search of the build directory turned up. But since that module is being added explicitly in this manifest file, it doesn’t need to be on the allow-list for the inherited Zephyr manifest.

Take control of your manifest with allow lists

Knowing exactly what libraries are being used in your build is part of good project management. Since manifest files let you target libraries and modules with version tags or commit hashes, this locks your project to a known-working state. I’m a huge advocate of this and gave an entire talk about Zephyr manifest files at the Embedded Open Source Summit.

Limiting your manifest files to libraries you are explicitly using helps you understand when upstream dependencies change. It may be a bit of a hassle to go through this process the first time, but doing so is a basic form of vetting your build and your product will be better for it.

Golioth Over-the-Air (OTA) Updates in most common cases are used for single-image firmware upgrade purposes. In that scenario, a device is notified about a new release. Such notification includes a release manifest, which contains information about new firmware. The most important metadata that a device gets is firmware version, hash, and URL (used to download the firmware).

Firmware is the only artifact that is tied to OTA release in that scenario. But the Golioth OTA service may also include multiple artifacts. This allows you to implement multi-image upgrades, e.g. when there are many MCUs on a single device. Golioth OTA even supports artifacts that are not firmware, but large blobs of data of any kind. Examples include AI models, images and arbitrary binary blobs.

Device with a display

This article shows an example application running on a device with a display. Implemented functionality is simple: just displaying an arbitrary image. In the future we would like to add more capabilities, so firmware upgrade will be implemented. Additionally we would like to change the displayed image without upgrading the whole firmware.

Multi-component OTA

The Golioth SDK exposes high-level APIs to easily setup single-image firmware upgrade (golioth_fw_update_init()). This automatically creates a thread that observes newest firmware release, upgrades it when notified and reboots to run the new version.

In the case of multi-component releases we will handle the manifest in application code. Let’s first implement a callback that gets executed when a new release is available:

struct ota_observe_data
{
    struct golioth_ota_manifest manifest;
    struct k_sem manifest_received;
};

static void on_ota_manifest(struct golioth_client *client,
                            const struct golioth_response *response,
                            const char *path,
                            const uint8_t *payload,
                            size_t payload_size,
                            void *arg)
{
    struct ota_observe_data *data = arg;

    LOG_INF("Manifest received");

    if (response->status != GOLIOTH_OK)
    {
        return;
    }

    LOG_HEXDUMP_INF(payload, payload_size, "Received OTA manifest");

    enum golioth_ota_state state = golioth_ota_get_state();
    if (state == GOLIOTH_OTA_STATE_DOWNLOADING)
    {
        GLTH_LOGW(TAG, "Ignoring manifest while download in progress");
        return;
    }

    enum golioth_status status =
        golioth_ota_payload_as_manifest(payload, payload_size, &data->manifest);
    if (status != GOLIOTH_OK)
    {
        GLTH_LOGE(TAG, "Failed to parse manifest: %s", golioth_status_to_str(status));
        return;
    }

    if (data->manifest.num_components > 0) {
        k_sem_give(&data->manifest_received);
    }
}

The above code checks whether the release manifest was received correctly and OTA is not already in progress. Then the CBOR encoded manifest is decoded with golioth_ota_payload_as_manifest(). If the manifest is valid and it contains at least one component, the main application thread is notified by releasing a semaphore with k_sem_give(&data->manifest_received).

Now it is time to start manifest observation in main() and wait until a release manifest is received:

int main(void)
{
    struct ota_observe_data ota_observe_data = {};

    /* ... */

    golioth_ota_observe_manifest_async(client, on_ota_manifest, &ota_observe_data);

    k_sem_take(&ota_observe_data.manifest_received, K_FOREVER));

    /* ... */
}

At this point the application continues execution after the manifest is successfully received and parsed. The next step is handling of received components:

int main(void)
{
    /* ... */

    for (size_t i = 0; i < ota_observe_data.manifest.num_components; i++) {
        struct golioth_ota_component *component = &ota_observe_data.manifest.components[i];
        uint8_t hash_bin[32];

        hex2bin(component->hash, strlen(component->hash), hash_bin, sizeof(hash_bin));

        struct component_desc *desc = component_by_name(component->package);
        if (!desc) {
            LOG_WRN("Unknown '%s' artifact package", component->package);
            continue;
        }

        if (desc->version ?
            (component_version_cmp(desc, component->version) == 0) :
            (component_hash_cmp(desc, hash_bin) == 0)) {
            continue;
        }

    LOG_INF("Updating %s package", component->package);

        status = golioth_ota_download_component(client, component, desc->write_block, NULL);
        if (status == GOLIOTH_OK) {
            reboot = true;
        }
    }

    /* ... */
}

Information about each component is stored in the ota_observe_data.manifest.components[] array. Either version or hash is compared with the received component. When it differs, the new component is downloaded with golioth_ota_download_component() API.

Firmware and background components require different handling. This is achieved with component_descs[] array and helper functions:

struct component_desc
{
    const char *name;
    const char *version;
    uint8_t hash[32];
    ota_component_block_write_cb write_block;
};

static struct component_desc component_descs[] = {
    { .name = "background", .write_block = write_to_storage },
    { .name = "main", .write_block = write_fw, .version = _current_version },
};

static int component_hash_update(struct component_desc *desc, uint8_t hash[32])
{
    memcpy(desc->hash, hash, 32);

    return 0;
}

static int component_hash_cmp(struct component_desc *desc, const uint8_t hash[32])
{
    return memcmp(desc->hash, hash, 32);
}

static int component_version_cmp(struct component_desc *desc, const char *version)
{
    return strcmp(desc->version, version);
}

static struct component_desc *component_by_name(const char *name)
{
    for (size_t i = 0; i < ARRAY_SIZE(component_descs); i++) {
        struct component_desc *desc = &component_descs[i];

        if (strcmp(desc->name, name) == 0) {
            return desc;
        }
    }

    return NULL;
}

Downloaded firmware is written to flash directly, into second application slot:

static struct flash_img_context flash;

enum golioth_status write_fw(const struct golioth_ota_component *component,
                             uint32_t block_idx,
                             uint8_t *block_buffer,
                             size_t block_size,
                             bool is_last,
                             void *arg)
{
    const char *filename = component->package;
    int err;

    LOG_INF("Writing %s block idx %u", filename, (unsigned int) block_idx);

    if (block_idx == 0) {
        err = flash_img_prepare(&flash);
        if (err) {
            return GOLIOTH_ERR_FAIL;
        }
    }

    err = flash_img_buffered_write(&flash, block_buffer, block_size, is_last);
    if (err) {
        LOG_ERR("Failed to write to flash: %d", err);
        return GOLIOTH_ERR_FAIL;
    }

    if (is_last) {
        LOG_INF("Requesting upgrade");

        err = boot_request_upgrade(BOOT_UPGRADE_TEST);
        if (err) {
            LOG_ERR("Failed to request upgrade: %d", err);
            return GOLIOTH_ERR_FAIL;
        }
    }

    return GOLIOTH_OK;
}

The background image is written to file system using write_to_storage() callback:

enum golioth_status write_to_storage(const struct golioth_ota_component *component,
                                     uint32_t block_idx,
                                     uint8_t *block_buffer,
                                     size_t block_size,
                                     bool is_last,
                                     void *arg)
{
    const char *filename = component->package;
    struct fs_file_t fp = {};
    fs_mode_t flags = FS_O_CREATE | FS_O_WRITE;
    char path[32];
    int err;
    ssize_t ret;

    LOG_INF("Writing %s block idx %u", filename, (unsigned int) block_idx);

    if (block_idx == 0) {
        flags |= FS_O_TRUNC;
    }

    sprintf(path, "/storage/%s", filename);

    err = fs_open(&fp, path, flags);
    if (err) {
        LOG_ERR("Failed to open %s: %d", filename, err);

        return GOLIOTH_ERR_FAIL;
    }

    err = fs_seek(&fp, block_idx * CONFIG_GOLIOTH_BLOCKWISE_DOWNLOAD_BUFFER_SIZE, FS_SEEK_SET);
    if (err) {
        goto fp_close;
    }

    ret = fs_write(&fp, block_buffer, block_size);
    if (ret < 0) {
        err = ret;
        goto fp_close;
    }

fp_close:
    fs_close(&fp);

    if (err) {
        return GOLIOTH_ERR_FAIL;
    }

    return GOLIOTH_OK;
}

Displaying (updated) background

Firmware is updated automatically during next boot, so there is nothing more needed to start using it. Background image, on the other hand, needs to be loaded from file system in the application code:

static lv_img_dsc_t img_background;

static int background_show(void)
{
    char hash[32] = {};
    struct fs_dirent dirent;
    struct fs_file_t background_fp = {};
    lv_img_header_t *img_header;
    uint8_t *buffer;
    int err;
    ssize_t ret;

    err = fs_stat("/storage/background", &dirent);
    if (err) {
        if (err == -ENOENT) {
            LOG_WRN("No background image found on FS");
        } else {
            LOG_ERR("Failed to check/stat background image: %d", err);
        }

        return err;
    }

    LOG_INF("Background image file size: %zu", dirent.size);

    buffer = malloc(dirent.size);
    if (!buffer) {
        LOG_ERR("Failed to allocate memory");
        return -ENOMEM;
    }

    err = fs_open(&background_fp, "/storage/background", FS_O_READ);
    if (err) {
        LOG_WRN("Failed to load background: %d", err);
        goto buffer_free;
    }

    ret = fs_read(&background_fp, buffer, dirent.size);
    if (ret < 0) {
        LOG_ERR("Failed to read: %zd", ret);
        err = ret;
        goto background_close;
    }

    if (ret != dirent.size) {
        LOG_ERR("ret (%d) != dirent.size (%d)", (int) ret, (int) dirent.size);
        err = -EIO;
        goto background_close;
    }

    err = mbedtls_sha256(buffer, dirent.size, hash, 0);
    if (err) {
        LOG_ERR("Failed to get update sha256: %d", err);
        goto background_close;
    }

    LOG_HEXDUMP_INF(hash, sizeof(hash), "hash");

    component_hash_update(&component_descs[1], hash);

    img_header = (void *)buffer;
    img_background.header = *img_header;
    img_background.data_size = dirent.size - sizeof(*img_header);
    img_background.data = &buffer[sizeof(*img_header)];

    lv_obj_t * background = lv_img_create(lv_scr_act());
    lv_img_set_src(background, &img_background);
    lv_obj_align(background, LV_ALIGN_CENTER, 0, 0);

background_close:
    fs_close(&background_fp);

buffer_free:
    free(buffer);

    return err;
}

Note that besides loading the background image, there is also SHA256 calculation using mbedtls_sha256(). This is needed to compare with the SHA256 hash received from OTA service in order to decide whether the background image needs to be updated.

Testing with native_sim

Round display with a black bezel around a white image with the Golioth Echo mascot at the center. A USB cable is plugged into the device on the right side of the screen.

XIAO ESP32S3 with Seeed Studio XIAO Round Display

The example-download-photo application is compatible with XIAO ESP32S3 with Seeed Studio XIAO Round Display. However it is possible to test with Native Simulator as well. To achieve that, the following command can be used:

# Build the example
west build -p -b native_sim/native/64 $(west topdir)/example-download-photo

# Run the example
west build -t run

Native Simulator uses the SDL library to emulate a display. On the first run it is blank because no background image is available. Now it is time to upload a background image as an OTA artifact and create a release. An example background image is included in the repository in background/Echo-Pose-Stand.bin. After rolling out an OTA release, this image is downloaded automatically to /storage/background file on the device, which is indicated with the following logs:

[00:00:01.310,007] <inf> example_download_photo: Received OTA manifest
...
[00:00:01.310,007] <inf> example_download_photo: component 0: package=background version=1.0.5 uri=/.u/c/[email protected] hash=6b4d243a362c0c4f63c535b2d2f7b8dfe4bcfbca69e7b2f8009f917458794c5e size=35716
[00:00:01.310,007] <inf> example_download_photo: Updating background package
[00:00:01.560,008] <inf> example_download_photo: Writing background block idx 0
[00:00:01.700,009] <inf> example_download_photo: Writing background block idx 1
...
[00:00:06.320,042] <inf> example_download_photo: Writing background block idx 34

Starting Native Simulator again shows the following screen:

We’re on a roll with our showcase of examples that came out of Golioth’s AI Summer. Today I’m discussing an example that records audio on an IoT device and uploads the audio to the cloud.

Why is this useful? This example is a great approach to sending sensor data from your edge devices back to the cloud to use in a machine learning (ML) training set, or just a great way to collect data samples from your network. Either way, we’ve designed Golioth to efficiently handle data transfer between constrained devices and the cloud.

The full example code is open source and ready to use.

Overview

The bones of this example are the same as the image upload example we showcased earlier. The main components include:

  1. An audio sample (or any other chunk of data you’d like to send).
  2. A callback function to fill a buffer with blocks from your data source.
  3. A function call to kick off the data upload.

That’s it for the device side of things. Sending large chunks of data is a snap for your firmware efforts.

The cloud side is very simple too, using Golioth Pipelines to route the data as desired. Today we’ll send the audio files to an Amazon S3 bucket.

1. An Audio Sample

The details of audio recording are not important for this example. WAV, MP3, FLAC…it’s all just 1’s and 0’s at the end of the day! The audio is stored in a buffer and all we need to know is the address of that buffer and its length.

If you really want to know more, this code is built to run on one of two M5 Stack boards: the Core2 or the CoreS3. Both have a built-in i2s microphone and an SD card that is used to store the recording. SD cards storage is a great choice for prototyping because you can easily pop out the card and access the file on your computer to confirm what you uploaded is identical. Full details are found in the audio.c file.

2. Callback function

To use block upload with Golioth, you need to supply a callback function to fill the data buffer. The Golioth Firmware SDK will call this function when preparing to send each block.

uint8_t audio_data[MAX_BUF_SIZE];
size_t audio_data_len;

/* Run some function to record data to buffer and set the length variable */

static enum golioth_status block_upload_read_chunk(uint32_t block_idx,
                                                   uint8_t *block_buffer,
                                                   size_t *block_size,
                                                   bool *is_last,
                                                   void *arg)
{
    size_t bu_offset = block_idx * bu_max_block_size;
    size_t bu_size = audio_data_len - bu_offset;
    if (bu_size <= block_size)
    {
        /* We run out of data to send after this block; mark as last block */
        *block_size = bu_size;
        *is_last = true;
    }
    /* Copy data to the block buffer */
    memcpy(block_buffer, audio_data + bu_offset, *block_size);
    return GOLIOTH_OK;
}

The above code is a very basic version of a callback. It assumes you have a global buffer audio_data[] where recorded audio is stored, and a variable audio_data_len to track the size of the data stored there. Each time the callback runs it reads from a different part of the source buffer by calculating the offset based on the supplied *block_size variable. The callback signals the final block by setting the *is_last variable to true, and updating the *block_size to indicate the actual number of bytes in the final block.

You can see the full callback function in the example app which includes full error checking and passes a file pointer as the user argument to access data on the SD card. The file handling APIs from the standard library are used, with a pointer to the file passed into the callback.

3. API call to begin upload

Now we start the upload by using the Stream API call, part of the Golioth Firmware SDK. Just provide the important details for your data source and the path to use when uploading.

int err = golioth_stream_set_blockwise_sync(client,
                                            "file_upload",
                                            GOLIOTH_CONTENT_TYPE_OCTET_STREAM,
                                            block_upload_read_chunk,
                                            NULL);

This API call includes four required parameters shown above:

  • client is a pointer to the Golioth client that holds info like credentials and server address
  • "file_upload" is the path at which the file should be uploaded (change this at will)
  • GOLIOTH_CONTENT_TYPE_OCTET_STREAM is the data type (binary in this case)
  • block_upload_read_chunk is the callback we wrote in the previous step

The final parameter is a user argument. In the audio sample app we use this to pass a pointer to read data from the file on the SD card.

Routing your data

The example includes a Golioth pipeline for routing your data.

filter:
  path: "/file_upload*"
  content_type: application/octet-stream
steps:
  - name: step0
    destination:
      type: aws-s3
      version: v1
      parameters:
        name: golioth-pipelines-test
        access_key: $AWS_S3_ACCESS_KEY
        access_secret: $AWS_S3_ACCESS_SECRET
        region: us-east-1

You can see the path in the pipeline matches the path we used in the API call of the previous step. This instructs Golioth to listen for binary data (octet-stream) on that path, and when found, route it to an Amazon S3 bucket. Once enabled, your audio file will automatically appear in your S3 bucket!

IoT data transfer shouldn’t be difficult

That’s worth saying twice: IoT data transfer shouldn’t be difficult. In fact, nothing in IoT should be difficult. And that’s why Golioth is here. It’s our mission to connect your fleet to the cloud, and make accessing, controlling, updating, and maintaining your fleet a great experience from day one. Take Golioth for a test drive now!

One of my favorite engineering processes at Golioth is our architecture design review. When building new systems, making consequential changes to existing systems, or selecting a third-party vendor, an individual on the engineering team authors an architecture design document using a predefined template. This process has been in place long enough (more than 18 months) that we have started to observe long-term benefits.

Some of the benefits are fairly obvious: more efficient implementation of large-scale functionality, better communication across engineering domains, smoother context sharing during new engineer on-boarding. Others are more subtle. One aspect of codifying a decision making process that I personally find extremely valuable is the ability to check the pulse of an organization over time. How thorough are design documents? How robust is the feedback provided? Are individuals providing push back regardless of any organizational hierarchy? Are discussions reaching resolution in an appropriate manner? Many of these questions center on how disagreements are resolved.

Disagreement is one of my favorite aspects of the engineering process. When done correctly, it drives a team towards an optimal solution, builds a stronger sense of trust between individuals, and results in more comprehensive exploration and documentation of a problem space. Through healthy disagreement, the Golioth engineering team typically arrives at one of three possible outcomes.

  1. Consensus is reached around one of the presented solutions.
  2. Consensus is reached around a new solution that incorporates aspects of each of the presented solutions.
  3. It is determined that more information is needed, or the decision does not have to be made to move forward.

However, reaching one of these outcomes does not necessarily mean that the process was effective. One failure mode is reaching perceived consensus around one solution, when in reality one individual doesn’t feel comfortable pushing back against the other. Another is abdicating responsibility by deferring a decision that actually does need to be made now. In the moment, it is not always clear whether the process is effective, but the beauty of codifying the interaction is that it can be evaluated in the future with the benefit of hindsight.

This week I opened up the review window for a design document I recently authored, and within 24 hours I had received high quality feedback from multiple members of the engineering organization. Furthermore, there were some key points of disagreement included in the feedback, which we resolved efficiently, with outcomes ranging from reaching consensus on a counter proposal to deferring a portion of the system to a future design document.

Compared to the early days of instituting the review process, more recent architecture design documents have involved more disagreement, but also more efficient resolution. While excess conflict can sow seeds of division, a mature engineering organization will turn differences of opinion into progress. Tackling any complex problem will involve some disagreement — for a strong team it will be the right amount.

It has been over three months since we announced Golioth Pipelines, and we have already seen many users reduce costs and unlock new use cases by migrating to Pipelines. As part of the announcement, we provided options for users who were currently leveraging Output Streams, which offered a much more constrained version of the same functionality, to seamlessly migrate their existing projects to Pipelines. Today, we are announcing December 12th, 2024 as the official end of life date for Output Streams.

For users operating projects that either started out with Pipelines, or have been transitioned to Pipelines as part of the opt-in migration process, there will be no change required. For the few projects that are still leveraging Output Streams, we encourage users to start the migration process now by submitting project information here, or to reach out to us at [email protected] with any questions or concerns. On December 12th, all projects that have not already been migrated to Pipelines will be automatically migrated with Pipelines configured to replicate the previous behavior of Output Streams. Output Stream configuration will no longer be accessible in the Golioth console.

The rapid adoption of Pipelines by the Golioth community has been exciting to witness, and we are looking forward to the ongoing growth of the platform via new Transformers and Destinations. If you are currently using Pipelines, or would like to see new functionality added, contact us on the forum!

Goiloth is excited to announce Golioth Solutions. Two new capabilities will help businesses to deploy IoT devices in a short period of time:

  • Golioth Solutions Services
  • Golioth Solutions Marketplace

A New Service Offering

Golioth Solutions Services solves many of the difficult problems at the beginning of developing an IoT product, namely pushing a fully formed idea out into the world. Golioth Solutions Engineers will help to identify how companies can best deploy the Golioth platform to solve their business needs and deliver a product that captures real-world data and provides consistent insight and control of your devices.

Our Solutions Engineers will work with you to formulate what is required for your particular business use-case and what kind of solution will get you there fastest. This includes hardware, firmware, cloud capabilities, fleet management, and application development. Our Solutions Engineers fill in the gaps where your team needs help. Perhaps you are a cloud software company looking to deploy a hardware device? Solutions Engineers will utilize existing hardware and firmware Solutions to send data up to Golioth and out to the platform of your choosing using Pipelines. What if you’re on the other spectrum and are a hardware company looking to connect custom hardware to the cloud? Our Solutions Engineers can set you up with known working hardware and firmware that you can use as reference while you develop your own custom hardware, while Solution Engineers consult on how data should be hitting the cloud and routing to outside services.

A Marketplace of Solutions

We are also launching Golioth Solutions Marketplace where customers can view existing solutions. These form the basis of “starting points” for many custom projects that Golioth Solutions Services will deliver.

In order to deliver IoT solutions in a short amount of time, we want our Solutions Engineers to have an arsenal of ready-made designs that can be customized to customers needs. This will include our internal Reference Designs as well as designs from Partners. We will continue to add to these designs and highlight them here on the blog when we have a new one available.

Designs from our Partners

The Golioth Solutions Marketplace includes production-grade hardware that has been produced by our Design Partners. The Solution also includes custom firmware and cloud capabilities that are targeted at a particular solution space and vertical application. Each of these designs are built on the Golioth platform and are customizable to specific business needs.

Many of these designs can also be repurposed towards a different vertical, based on the capability contained within the Solution. Our Solutions Engineers know how each of these technologies might fit a new, custom application. Since these solutions are developed by our Design Partners, the same creators of the hardware can also enhance and customize the product to your needs. As customers decide to scale, our Design Partners are well prepared to guide customers through production and productization.

Are you a product development shop interested in having your hardware listed in our Solutions Marketplace? Submit your designs now to start the process.

Introducing the Glassboard Shunt

One of our first Solutions comes from our design partner Glassboard, based out of Indianapolis in the US. The IoT Power Monitoring for Micromobility Solution includes a cellular-connected current shunt. This design is intended to measure battery currents on small vehicles. It works in both directions; measuring current being sourced to motors during motion, as well as charging currents going back into the battery. We recorded a video about this design and how it fits in with Golioth Solutions:

While this is initially targeted at micromobility applications, it’s easy to imagine how this device and starter firmware could be retargeted at a different vertical. One example could be monitoring a DC power source that powers LED lighting for a construction application.

How Golioth Solutions Engineers use designs

Solutions Engineers take input from customers and determine if any of our existing designs (like the Glassboard current shunt) are a good fit for the application at hand. Perhaps there is a new DC current measurement that could benefit from the existing hardware, but it needs to be tweaked to better fit the application space. Our Solutions Engineers first modify and test the firmware to fit the device needs, and then work with the customer to determine where the resulting data will go, and if there are additional needs around visualization or control of the fleet of devices. If the hardware requires some kind of modification, our Solutions Engineers will connect customers with the original designers to discuss the logistics of creating a custom version of the existing hardware.

Golioth Reference Designs

Another source of Golioth Solutions includes our range of Reference Designs, which can be customized and delivered by Golioth Solutions Services. We have been working on and refining Reference Designs for a few years now. These are end-to-end demonstrations of Golioth, built on top of custom hardware.

What about licensing? Well, all Golioth Reference Design Hardware is open source with a very permissive license. Customers can take the underlying hardware to one of our Design Partners and have them modify and extend the capabilities and refine it for production. You will be starting from a solution that is continually tested and can be easily extended using off-the-shelf sensor breakouts. Our Solutions Engineers can get you started extra quickly using the Follow Along Hardware version of these designs, which includes firmware that targets off-the-shelf development boards, in addition to the sensors. You can get started quickly, with no custom hardware required.

New Services + New Marketplace = Quicker time to market

Golioth Solutions and our associated marketplace exists to help users that need an IoT solution for their business, but don’t necessarily have the time or capabilities to build it themselves. We can bootstrap your solution from a sketch on a page to a working IoT device backed by a powerful IoT platform that handles all your data. Once the idea is proven out, we have a well-defined handoff to our Design Partners who can assist building that first device into a fleet of production-ready hardware that you can deploy to the field. You will be prototyping and testing using an IoT platform that is built for scale.

If you’d like to start building an IoT Solution that will serve your business, please get in touch! You can email [email protected] to find out more or fill out this form to directly schedule a meeting.

Recently, we teamed up with Qualcomm to showcase Golioth’s AI features. This demo stands out because we used the Qualcomm RB3 Gen 2 Development Kit running Linux. Staying true to our demo culture, we wanted to share how we pulled it off, what we learned about using Golioth with Linux, and where we might take this in the future.

Let’s dive in!

Wait, Golioth supports Linux??? 🐧

If you’ve been following us for a while, you probably know about our support for microcontrollers—from Zephyr tips to our cross-platform Firmware SDK. But you won’t find much mention of connecting Linux devices in our docs or blogs because we don’t officially support managing Linux-based devices. I say officially because we’ve actually had a Linux port for quite a while. It started as part of our CI testing infrastructure, helping us speed up tests on the Firmware SDK so that we can test more frequently than what you can do with physical devices.

Over the years we’ve received many requests to support Linux-based applications with a few different configurations. Sometimes a company was developing a product that had both an MCU and Linux Gateway (like a Thread Border Router) and wanted to manage the entire fleet with Golioth. Other customers were building a complex system that had both a MPU and MCU in the same device. And of course, many more are building a Linux edge-type device.

Since the scope of the Linux port was initially narrow, it was never designed to be a full “SDK”. Whenever a customer would ask if they could use the port for their embedded Linux device, we usually steered them away and pointed them to folks like Foundries.io or Balena.

Working with Foundries.io & AI Hub

We recently caught up with our friends at Foundries.io, who joined Qualcomm a few months ago, to see what they’ve been up to (we’ve collaborated in the past.) They were excited about some of Golioth’s new model management capabilities and connected us with the team from AI Hub (formally Tetra AI.) We discussed doing a joint demo together and Qualcomm wanted to highlight this new Linux-based device. Our mission is to be the universal connector for IoT so we were up for the challenge. After some brainstorming we got our hands on their latest devkit and got to work.

Getting the Firmware SDK up and running on Linux

Building the a Linux application that uses the Golioth Firmware SDK is as straightforward as building any other C program with CMake and requires minimal dependencies (see an example here). However, getting that application onto an embedded Linux device, and managing its lifecycle requires additional infrastructure.

RB3 Gen2 Device on Foundries

Foundries.io is a perfect complement in this scenario, with existing support for the RB3 Gen2 devkit, and a simple container-based GitOps workflow for managing applications running on their Linux microPlatform (LmP) distro. Flashing the device with the LmP image, building an OCI image with a basic Golioth “Hello, World” application, and remotely deploying the application to the device only took minutes.

Leveraging the Hardware

The value of any embedded device is tied to how well the software is able to leverage the hardware, and the RB3 Gen 2 is an embarrassment of riches compared to the microcontrollers we usually interface with at Golioth. Based on the QCS6490 SoC, it includes a Kryo 670 CPU with 8 application processing cores, an Adreno 643L GPU, and a Hexagon DSP for accelerating AI workloads. Additionally, the RB3 Gen 2 Development Kit boasts a low and high resolution camera, as well as audio peripherals and an array of sensors.

RB3 Gen 2 Object Detection

AI Hub offers pre-tuned AI models optimized for Qualcomm hardware like the QCS6490, many of which leverage its robust image processing capabilities. Furthermore, Qualcomm provides the Intelligent Multimedia SDK (IM SDK), which includes a suite of GStreamer plugins that make it straightforward to access both peripherals and acceleration hardware. Combining these together with Golioth means that we can add connectivity to the equation, making it possible to stream data to the cloud, control processing remotely, and manage artifacts used in the processing pipeline.

Streaming Inference Results

We selected the YoloNAS model from AI Hub to perform object detection on the RB3 Gen2. The application constructed a GStreamer pipeline that pulled video from the high resolution camera, passed it to the model for inference, then composed the object detection results with the original video data in order to render a bounding box around objects before passing the final video to the display.

RB3 Gen 2 Inference Stream

We also injected Golioth into the GStreamer pipeline, such that messages could be streamed to the cloud to notify when certain classes of objects were detected. As with all data streamed to Golioth, these messages could subsequently be routed to any other destination via Golioth Pipelines.

Remotely Controlling Image Processing

Outside of the GStreamer pipelines, we setup Golioth Remote Procedure Call (RPC) handlers that allowed for image processing and inference to be paused and resumed remotely. This functionality could be further extended to stream the current frame to an object storage destination via Golioth when processing is paused, all without requiring any physical intervention with the device.

RB3 Gen 2 RPC

Managing and Deploying AI Models

While Foundries.io handles application updates, being able to manage other types of artifacts used by applications, such as the AI models and labels, enables efficient updates without needing to rebuild and deploy. Integrating Golioth OTA into the application meant that the application was notified immediately when a new model was available, and was able to download and integrate it into the processing pipeline quickly.

RB3 Gen 2 Model Update

Lessons and future explorations

We had set out to create Golioth application that would be useful on Linux and it was a success. We’ve proven to ourselves that Golioth’s services are useful for other IoT device types, especially embedded Linux, and that the Firmware SDK can work effectively in this context. Taking the code we developed we’ve already identified how we might evolve it into more of an agent or daemon and how we might better integrate with update mechanisms, especially on Yocto and Buildroot based distributions.

We’ll continue to explore the Linux for IoT space and see if and when it makes sense for us to do more here. Of course, you can count on us to continue to do more and more with MCUs. But we’re curious to hear from the community what they think – should Golioth invest in supporting Linux officially? What features or use cases would you like to see? Please share your thoughts on our forums!

Using the Hugging Face Inference API for Device Audio Analysis

Golioth Pipelines works with Hugging Face, as shown in our recent AI launch. This post will highlight how to use an audio classification model on Hugging Face that accepts data recorded on a microcontroller-based device, sent over a secure network connection to Golioth, and routed through Pipelines.

While most commonly known as the place where models and data sets are uploaded and shared, Hugging Face also provides a compute service in the form of its free serverless inference API and production-ready dedicated inference endpoints. Unlike other platforms that offer only proprietary models, Hugging Face allows access to over 150,000 open source models via its inference APIs. Additionally, private models can be hosted on Hugging Face, which is a common use case for Golioth users that have trained models on data collected from their device fleets.

Audio Analysis with Pipelines

Because the Hugging Face inference APIs use HTTP, they are easy to target with the webhook transformer. The structure of the request body will depend on the model being invoked, but for models that operate on media files, such as audio or video, the payload is typically raw binary data.

In the following pipeline, we target the serverless inference API with an audio sample streamed from a device. In this scenario, we want to perform sentiment analysis of the audio, then pass the results onto Golioth’s timeseries database, LightDB Stream, so that changes in sentiment can be observed over time. An alternative destination, or multiple destinations, could easily be added.

Click here to use this pipeline in your project on Golioth.

filter:
  path: "/audio"
steps:
  - name: emotion-recognition
    transformer:
      type: webhook
      version: v1
      parameters:
        url: https://api-inference.huggingface.co/models/superb/hubert-large-superb-er
        headers:
          Authorization: $HUGGING_FACE_TOKEN
  - name: embed
    transformer:
      type: embed-in-json
      version: v1
      parameters:
        key: text
  - name: send-lightdb-stream
    destination:
      type: lightdb-stream
      version: v1

Note that though Hugging Face’s serverless inference API is free to use, it is rate-limited and subject to high latency and intermittent failures due to cold starts. For production use-cases, dedicated inference endpoints are recommended.

We can pick any supported model on Hugging Face for our audio analysis task. As shown in the URL, the Hubert-Large for Emotion Recognition model is targeted, and the audio content delivered on path /audio is delivered directly to Hugging Face. An example for how to upload audio to Golioth using an ESP32 can be found here.

Results from the emotion recognition inference look as follows.

[
  {
    "score": 0.6310836672782898,
    "label": "neu"
  },
  {
    "score": 0.2573806643486023,
    "label": "sad"
  },
  {
    "score": 0.09393830597400665,
    "label": "hap"
  },
  {
    "score": 0.017597444355487823,
    "label": "ang"
  }
]

Expanding Capabilities

Countless models are uploaded to Hugging Face on a daily basis, and the inference API integration with Golioth Pipelines makes it simple to incorporate the latest new functionality into any connected device product. Let us know what models you are using on the Golioth Forum!

TensorFlow Lite is a machine learning (ML) platform that runs on microcontroller-based devices. It’s AI for IoT, which raise a few interesting challenges. Chief among them is figuring out how to update the ML model the device is currently using. Wonder no longer, Golioth has already figured this part out! Let’s walk through how to update a TensorFlow Lite model remotely!

Today’s example is based on the ESP-IDF ecosystem and uses an m5stack CoreS3 board. The application uses a TensorFlow Lite learning model to recognize when you speak the words “yes” and “no”. After performing an over-the-air (OTA) update of the learning model, the words “stop” and “go” will also be recognized.

What is a TensorFlow Lite Model?

Applications that use TensorFlow Lite need a machine learning model that has been trained on a large data set. TensorFlow (TF) can run on microcontrollers because this learning model has already been trained using vastly greater processing power. The “edge” device can use this pre-trained model, but will not be able to directly improve the learning set. But that set can be updated in the field.

Golioth has a TensorFlow Model Update example application that updates the TF learning model whenever a new version is available on the cloud. In this way, you can train new models and deploy them to a fleet of devices. If you capture data on device and send it up to Golioth, you can use your live captures to also train new models like Flox Robotics does.

Overview of the Model Update Process

The basic steps for updating a TensorFlow model are as follows:

  1. Upload a new learning model as an Artifact on Golioth.
  2. Roll out a release that includes your model as a non-firmware artifact.
    • You can update your Model all by itself without performing a full firmware update.
  3. Device recognizes and downloads the newly available version.
  4. Device re-initializes the TensorFlow application to use the new model.

The ability to upload the model separately from the device firmware delivers a few benefits. It saves bandwidth and power budget as the download will take less time to download. You will also be tracking fewer firmware versions as the model will be versioned separately.

Core Concepts from the Golioth Firmware SDK

There are two core concepts from the Golioth Firmware SDK used in this example. The first is that an IoT device can register to be notified whenever a new release is available from the cloud.

/* Register to receive notification of manifest updates */
enum golioth_status golioth_ota_observe_manifest_async (struct golioth_client *client, golioth_get_cb_fn callback, void *arg)

/* Convert the received payload into a manifest structure */
enum golioth_status golioth_ota_payload_as_manifest (const uint8_t *payload, size_t payload_size, struct golioth_ota_manifest *manifest)

The second concept is the ability to retrieve Artifacts (think binary files like a firmware update or a new TF Lite model) from Golioth.

/* Use blockwise download to retrieve an Artifact */
enum golioth_status golioth_ota_download_component (struct golioth_client *client, const struct golioth_ota_component *component, ota_component_block_write_cb cb, void *arg)

These two concepts are applied in the Golioth TF Lite example to detect when a new model is available, download it to local storage, and begin using it in the application. While this example uses ESP-IDF, the Golioth Firmware SDK also works with Zephyr and ModusToolbox.

Model Update Walk Through

1. Upload your new TensorFlow Model to Golioth

This step couldn’t be simpler: head over to the Golioth web console, navigate to Firmware Updates→Artifacts, and click the Create button.

Browser window showing the "Upload an Artifact" dialog on the Golioth web console

Give the artifact a Package ID that the device will use to recognize it as a new model. Here I’ve used the clever name: model.

Each file you upload requires an Artifact Version number that follows the semantic versioning syntax (ie. v1.2.3). Once you’ve filled in these fields select the file you want to upload and click Upload Artifact.

2. Roll out a release of the new Model

Rolling out your new model to devices is even easier than the upload step. Navigate to Firmware Updates→Releases and click the Create button.

Golioth web console showing the Create Release dialog

Under the Artifacts dropdown menu, select the artifact created in the previous step (note the package name and version number). I have also enabled the Start rollout? toggle so that this release will be immediately available to devices once the Create Release button is clicked.

This will roll out the model to all devices in the fleet. However, the Blueprint and Tags fields may optionally be used to target to a specific device or group of devices.

3. Device-side download and storage

Learning models tend to be large, so it’s a good idea to store the model locally so that it doesn’t need to be re-downloaded the next time the device goes through a power cycle. However, the process is the same no matter what approach you take. The model will be downloaded in blocks, using a callback function your app supplies to place the block data into some storage location.

There is a bit of a song and dance here to avoid deadlocking callbacks. The first step is to register a callback when a new release manifest is received from Golioth:

/* Listen for OTA manifest */
int err = golioth_ota_observe_manifest_async(client, on_manifest, NULL);

Here’s the on_manifest callback with all the error checking and most of the logging removed for brevity. Since this is running in a callback, I push the desired manifest component into a queue which will be read later from the main loop.

#define MODEL_PACKAGE_NAME "model"

static void on_manifest(struct golioth_client *client,
                        const struct golioth_response *response,
                        const char *path,
                        const uint8_t *payload,
                        size_t payload_size,
                        void *arg)
{
    struct golioth_ota_manifest man;

    golioth_ota_payload_as_manifest(payload, payload_size, &man);

    for (int i = 0; i < man.num_components; i++)
    {
        if (strcmp(MODEL_PACKAGE_NAME, man.components[i].package) == 0)
        {
            struct golioth_ota_component *stored_component =
                (struct golioth_ota_component *) malloc(sizeof(struct golioth_ota_component));
            memcpy(stored_component, &man.components[i], sizeof(struct golioth_ota_component));

            xQueueSendToBackFromISR(xQueue, &stored_component, NULL);
        }
    }
}

Next, we have a function to perform the download of the components in the queue. I’ve removed some housekeeping code to make this more readable. At its core, this function gets a pointer to write the file to an SD card, generates the path and filename, then begins a block download using write_artifact_block as a callback for each block received.

static void download_packages_in_queue(struct golioth_client *client)
{
    while (uxQueueMessagesWaiting(xQueue))
    {
        struct golioth_ota_component *component = NULL;
        FILE *f = NULL;

        /* Store components with name_version format: "componentname_1.2.3" */
        size_t path_len = sizeof(SD_MOUNT_POINT) + strlen("_") + strlen(component->package)
            + strlen("_xxx.xxx.xxx") + strlen("\0");

        char path[path_len];
        snprintf(path,
                 sizeof(path),
                 "%s/%s_%s",
                 SD_MOUNT_POINT,
                 component->package,
                 component->version);

        GLTH_LOGI(TAG, "Opening file for writing: %s", path);
        f = fopen(path, "a");

        /* Start the block download from Golioth */
        golioth_ota_download_component(client, component, write_artifact_block, (void *) f);

        fclose(f);
        free(component);
    }
}

Here’s the full block callback function. It’s quite straight-forward. The Golioth SDK will repeatedly run the callback; each time it is called, your application needs to write the data from block_buffer to a storage location.

Normally the offset for each write is calculated by multiplying the block_idx by the block_size. However, since I’ve passed a file stream pointer in as the user argument, we simply make subsequent writes and the file pointer will increment automatically.

static enum golioth_status write_artifact_block(const struct golioth_ota_component *component,
                                                uint32_t block_idx,
                                                uint8_t *block_buffer,
                                                size_t block_size,
                                                bool is_last,
                                                void *arg)
{

    if (!arg)
    {
        GLTH_LOGE(TAG, "arg is NULL but should be a file stream");
        return GOLIOTH_ERR_INVALID_FORMAT;
    }
    FILE *f = (FILE *) arg;

    fwrite(block_buffer, block_size, 1, f);

    if (is_last)
    {
        GLTH_LOGI(TAG, "Block download complete!");
    }

    return GOLIOTH_OK;
}

The new Model is now stored as a file on the SD card, named to match the package name and version number. This is quite handy for troubleshooting as you can pop out the SD card and inspect it on a computer.

4. Switching to the new model on the device

Switching to the new model is where you will likely spend the most time making changes on your own application. I was working off of the TensorFlow Lite micro_speech example from Espressif which hardcodes several of the parameters relating to loading and using a learning model.

The approach that I took was to move the pertinent learning model settings to RAM and load them from a header that was added to the model. This header formatting is explained in the README for the Golioth example. In our example code, the bulk of this work is done in model_handler.c.

For your own application, keep in mind any variables necessary to load a new model and how those may change with future training updates.

Take Golioth for a Spin!

Golioth is free for individuals, with usage pricing that includes 1 GB per month of OTA data. So you can get small test fleet up and running today before seeking budget approval.

Those interested in pushing sensor data back up to the cloud for training future models will find our examples on uploading audio and uploading images helpful. We’d love to hear your questions or just see what cool things you’re working on, so take a moment to post your progress to the Golioth forum.