I built the Internet of Memes, making it possible for anyone at my company to push funny pictures to the ePaper display on the desk of all our coworkers.

I work at Golioth, but I don’t really work at Golioth. We’re headquartered in San Francisco but we’re also a completely remote company spread across the US, Brazil, and Poland. I love working with insanely talented and interesting people, but we don’t get to see each other face to face in the office each day. So at risk of “all work and no play”, I took on an after-hours project to make sharing memes with each other a bit more IRL.

ePaper is Standard-Issue at this Company

MePaper powered by Golioth

As part of onboarding, each new employee is shipped is a small development board that includes an ePaper display, RGB LEDs, buttons, a speaker, accelerometer, and WiFi. It’s Adafruit’s MagTag and we use it for developer training so it makes sense to hand one out to everyone when they join up. But the boards sit unused the vast majority of the time. Why not use that messaging real estate for something fun?

U Got IoT! meme

The project comes in two parts: firmware for the ESP32-S2 that drives the MagTag and a Slack Bot that handles the image publishing. Let’s jump into the device-side first.

Espressif’s ESP-IDF, Golioth’s Device Management, and Google Drive

The device firmware got a bit of a head start because we have a MagTag-based demo that’s part of the Golioth Firmware SDK. The demo can already drive the hardware and use all Golioth features.

RAM considerations

The images for the display are 296×128 black and white so you need 4.7 kB of RAM to download an image. This was a challenge for the on-chip SRAM, but the ESP32-S2 WROVER module on these boards include a 2MB SPI PSRAM chip built into the module. I enabled this in Kconfig and used malloc() to place a framebuffer for the display in external ram.

#define FRAME_BUFFER_SIZE   5000

/* Get memory space for frame buffer */
fb = (uint8_t *) malloc(FRAME_BUFFER_SIZE);

The buffer is slightly oversized because it’s used to store an HTTP response that includes a bit of header data.

Leverage the ESP-IDF http_client

The next challenge was how to get the images onto the device. It is possible to repurpose Golioth’s OTA system to supply these images, but for reasons I’ll get to in the next section that’s is an inefficient approach. What I really want is a way to upload images to one single location and have many devices access them. That’s how webpages work, right?

I adapted the https_with_url() example from the ESP-IDF esp_http_client_example code. Just supply a URL to the image and the http(s) client will grab the page and dump it into the frame buffer.

void fetch_page(void) {
    ESP_LOGI(TAG, "MEME FILE");
    const char* url_string_p = mepaper_provisional_url;
    esp_http_client_config_t config = {
        .url = url_string_p,
        .event_handler = _http_event_handler,
        .crt_bundle_attach = esp_crt_bundle_attach,
        .user_data = (char*)fb,
        .timeout_ms = 5000,
    };
    esp_http_client_handle_t client = esp_http_client_init(&config);
    esp_err_t err = esp_http_client_perform(client);

    if (err == ESP_OK) {
        ESP_LOGI(TAG, "HTTPS Status = %d, content_length = %d",
                esp_http_client_get_status_code(client),
                esp_http_client_get_content_length(client));
                validate_and_display();
    } else {
        ESP_LOGE(TAG, "Error perform http request %s", esp_err_to_name(err));
        xEventGroupSetBits(_event_group, EVENT_DOWNLOAD_COMPLETE);
    }
    esp_http_client_cleanup(client);
}

A question of image formats

What kind of image should this be, anyway? My friend Larry Bank has written a bunch of image decoders for embedded systems which I considered using. But I ended up going with an easier solution. A format I had never heard of before but that’s perfectly suited for a 1-bit-per-pixel display like this one is called PBM or “Portable Bit Map”.

The PBM specification is for a 1-bit color depth image. The byte-ordering and endianness is exactly what is needed for this ePaper display, and the header is super simple to parse. When the file is downloaded to the PSRAM, the firmware searches for a header. Once found, the data after that header is piped directly to the display. Voila!

static const uint8_t HEADER_PATTERN[] = { 0x31, 0x32, 0x38, 0x20, 0x32, 0x39, 0x36, 0x0A };
static const uint8_t WHITESPACE[] = { 0x0A, 0x0B, 0x0D, 0x20 };

static bool is_whitespace(uint8_t test_char) {
    for (uint8_t i=0; i<WHITESPACE_SIZE; i++) {
        if (test_char == WHITESPACE[i]) {
            return true;
        }
    }
    return false;
}

static bool headers_match(uint8_t *slice) {
    // Spec is complicated by "whitespace" instead of a specific character
    for (uint8_t i=0; i<HEADER_LEN; i++) {
        if (i == WHITESPACE_IDX) {
            if (is_whitespace(slice[i]) == false) {
                return false; //this char must be whitespace
            }
        }
        else if (HEADER_PATTERN[i] != slice[i]) {
            return false; //members don't match
        }
    }
    return true;
}

Images are created in your favorite editing software (I use GIMP). Set the mode to 1-bit depth size it to 296×128 (then rotate counterclockwise by 90 degrees) and export it to as PBM.

A Slack Bot Made of Python

With the device side proved out, I turned my mind to the “distribution” problem. How do I let different users control the fleet? Here is a list of consideration on my mind:

  • How can users register their own devices for this service?
  • How can users push their own images to the entire fleet?
  • How will firmware upgrades be managed?

As a fully-remote company, we’re on Slack all day long. A Slackbot (a daemon that sits in the chat listening for commands) makes the most sense for granting some fleet control to everyone in the company.

How should the devices be provisioned so that they are all accessible by the bot? This topic could itself be a whole post. I decided the best way was to let each person add their MagTag to a Golioth project they control, assigning it the mepaper tag (“meme” + “epaper” = mepaper). They then share an API key to that project to receive notifications when there is a new image available for the screen. OTA firmware updates can also be staged for registered devices.

Starting from a stock MagTag, you download the latest firmware release and flash it to the board using esptool.py (the programming tool from Espressif). When the board boots, a serial shell is available via USB where WiFi credentials and Golioth device credentials can be assigned by the user. That’s it for setup, the bot takes care of the rest.

Mepaper Slack bot publish command

Each device is observing a mepaper endpoint on the Golioth LightDB State. It’s a URL to the image to display, and whenever it changes the boards get notified, then uses the https client to download them. The bot listens for a publish <url_of_image> message on the Slack channel, validates the file as existing and being in the proper format. Then the bot iterates through each API key it has stored, updating the LightDB State endpoint for each board tagged as mepaper.

This was chosen instead of pushing OTA artifacts for each image. That’s because the image would need to be uploaded multiple times (one for each project API-key registered) and a copy of all past memes would remain as artifacts. By uploading the images to our Google Drive, they are only stored one time, and all devices can access them as long as they have a copy of the share link.

Mepaper Slack bot commands

Of course, as a quick and dirty hack, the original version of the firmware wasn’t the greatest. I made sure to use the OTA firmware upgrade built into our sample code. As with the image updates, users who the bot recognizes as admins can issue /register ota and /register rollout commands. These accept parameters like version, package name, and the filename of the new firmware. For security, the bot will only upload firmware that is located in the same directory as the bot itself.

Oh the Fun of a Side-Channel for Memes!

Having a dedicated, non-intrusive screen for Memes turns out to be an excellent boost to the remote workplace vibe. Your computer doesn’t need to be unlocked, and you don’t need to switch context to see the newest message. The reason for pushing a meme doesn’t have to be meaningful, it is purely a mechanism for spreading joy. And it worked!

We’ve previously written about the Demo Culture at Golioth. The “Demo early, demo often” mantra is an effort to break out of a rut where you only communicate with your coworkers for Official Business™. Sending memes is one more way to be silly, and enjoy each others’ company, even if we can’t be in the same room. It’s also a neat way to eat our own dogfood.

Future Improvements

There’s lots of room to improve this project, but the first area I’ve been actively working on is the fleet organization itself. When starting out I wanted each user to be able to provision their own device so I chose to have users add the device to their project and share an API key with the bot using a private command on Slack. This an interesting application for controlling devices across multiple projects, but it’s overly complex for this particular application.

Bot-created, credentials, all on the same project

Currently, users need to generate credentials, assign them to the device via serial, and share an API key using the Slack bot. A better arrangement would have been to use the Slackbot to acquire the credentials in the first place. A future improvement will change the /register command to generate new PSK-ID/PSK credentials and send them to the user privately in Slack. This way all devices can be registered on one project, even though the user doesn’t have access to that project’s Golioth Console.

OTA is crucial to fleet management

With the fleet already deployed, how do you make these changes? Via OTA firmware update, of course. I have already implemented a key rotation feature using Golioth’s Remote Procedure Call. The bot can send each device a new PSK-ID/PSK, which will be tested and store, move each member of the fleet over to one singular project.

The engineering community has undergone a remote transformation in the last two years. With such a radical change in corporate culture comes a number of challenges that an office culture never had to face. Is scrum still relevant in fully remote companies? Are standups needed? What is the process to onboard new hires into the engineering culture? How do we invent better remote-first approaches, rather than trying to (dysfunctionally) emulate an office experience in a remote environment?

A well implemented “demo culture” can help companies establish healthy habits for sharing, soliciting constructive comments without specifically asking for them, being a source of inspiration, and facilitating cross-team collaboration while preventing “silo-building” within the company. As we continue to build The Company of Golioth, we are also building The Culture of Golioth. This article will walk through some of the already implemented pieces.

How Demo Culture works

Demo culture also informally translates “lunch conversations” you might have had in an in-office setup into a more efficient, asynchronous remote equivalent. You no longer need to find a place to sit at the right table: anyone can be part of any conversation.

On a day-to-day basis, we use two tools to achieve our goals:

  • Loom – A video sharing service to quickly record demos
  • A “#demos” slack channel to capture conversations around the topic

An engineer might have a new feature on the Golioth back end that they have been developing, but it only exists on their local branch and local build. They decide to set up a small example application that tests the feature and helps to showcase that it is part of an upcoming release.

These manifest as 2-3 minute videos showing a very small part of development, possibly a subtask of an overall story in our development. Because they act as markers throughout a sprint, we also reference these during sprint reviews. This is especially helpful if there is work that wasn’t planned, carries across multiple sprints, or is an idea for an upcoming feature. Sometimes…the demos are just for fun or to showcase something the engineer is learning.

Benefits

You might be asking, “Why are you so informal about it? If demos and showcasing work are critical to The Company of Golioth, why not make it a formalized task that engineers need to do every day/week/sprint/month?”

The informality is one of the best parts, and also why it’s a part of The Culture of Golioth. We want team members to share ideas they’re having and to make it part of their process as they explore new things. Most of all, we want people to be excited about the things they’re working on.

On a broader basis, the entire company benefits. First and foremost is early access to new ideas.  An engineer who shares what they are thinking about on a regular basis helps to start conversations that come up during formalized conversations (sprint planning). Since we are a fully remote company, video meetings are a necessity to transfer information, but also a hindrance to getting other work done (as in any company). Demos pollinate interesting ideas without the formality (and synchronous demands) of group video chats.

Another benefit is that demos always includes a hands-on approach, and not just a passing comment about “how something could be better”. This means that an engineer that doesn’t like the placement of a button could take the initiative to actually move it on their local machine and then showcase it, all without interrupting the flow of other engineers. They simply show the difference, state their case, and move on. There’s a lot of power in “show, don’t tell” to help foster understanding of the points being proposed.

When there are ideas that bubble up as a possible change to direction, either on a smaller development scale or even at a company level, we can start to discuss around this particular video showcase of the proposed change or demonstration. Because it’s informal, it’s just part of the “lunch conversation” mentioned earlier. Hashing out an idea that has come up in-person at lunch is much easier, unfortunately; you can immediately talk through what might be good or bad about a new approach. With Demo Culture, a video showcases the idea and the Slack channel acts as an anchor around which we can discuss. That feedback might also spin off future demos from the original submitter or from others. And they make it easier to come back to those conversations weeks or months later (more on that in just a bit).

Finally, in an ever growing company, we need more ways to interact with each other, as simple video chats will get harder to do. Watching someone on video is no replacement for sitting next to them at lunch, but can help to get to know someone’s personality and recognize their areas of interest. As teams grow, the tendency is for teams to silo as it becomes harder for everyone to keep track of more people. Seeing new, friendly faces showcasing something they’re interested in not only allows people in other teams to see what’s being worked on, but also might inspire new collaborations with other teams. Cross pollination is possibly the most beneficial output of Demo Culture.

Challenges

Trying to change company culture is always tough, even if you’re building it from the ground up. And we are! We are still a small team, so this is the right time to putting ideas and practices in place. But a corollary to that is the challenge as a company grows and shifts. When we bring on new employees, we will need to lead by example and work to maintain the culture over time.

As much as we can encourage employees to demonstrate what they’re working on, each employee is different. Some might want to only show completely finished work, whereas others might be comfortable showing an early demo of something. You need to plan for an uneven distribution, especially as new employees lean into the culture. At the top level, putting requirements on top of something like demos could negatively impact how they are perceived. And while we showed above that sharing demos should be very “low overhead”, it’s still important that employees feel like they have the time to think through and contribute something to the discourse.

Finally, the work itself can be another challenge. Someone working on a long term project might feel that there’s only something to contribute at the end. We still try to encourage them to share waypoints along their journey. It helps others in the organization understand what they’re working on, allows people to get excited about upcoming features, and solicits soft feedback from future stakeholders. In all cases, it’s important that we encourage our teammates to share whatever they feel comfortable and hopefully enjoy doing it.

Demo of a Demo

We must walk the walk! Let’s look at how this works in practice.

In our last blog post, we showed how to set credentials of devices over goliothctl and using the Zephyr shell. This was the result of lots of great work from our backend team over the past few weeks. Throughout the process, they shared progress and showed how it would work using different tools.

First our lead engineer Alvaro showed a demo of the provisioning from a shell on our test Arduino SDK:

Then Miguel showed how to set up credentials using goliothctl:

Then there was another demo where Alvaro showed how he was using goliothctl

Mike from the DevRel team showed how he was using this to provision devices for a training event, using similar methods that weren’t yet available to the public:

Finally this all culminated in a released feature that is now available to all users and that we shared publicly on the blog:

Demo Culture is an evolving concept

We are continuously codifying and experimenting with the concept of Demo Culture at Golioth. As we continue to build out this idea, we will share with the broader community. If you know of other companies doing similar things, we would love to hear about it on our community Discord or on social media. We will share some of the companies we took inspiration from below.

Reference material

The concept of “Demo culture” is not new as an overall topic. We have borrowed and modified from a range of different sources–most notably GitLab–to help bolster our initial concept of how it will work. What’s more, we continue to refine this process:

Why is it so hard to build IoT products? Beyond the many technical challenges during R&D, companies accustomed to building products that are not connected have to go through a challenge of cultural transformation. A big part of this is shifting from selling a standalone hardware product to selling a hardware product with a software service. This move a company from the low-bar of warranty-based service to a company working with service level agreements (SLA) and all their challenges. Let’s explore how this impacts company culture inside these organizations.

Building great, but not connected devices

If your company is used to shipping devices that are not connected, it’s likely the structure of your operations is fairly straightforward. If the device works, everyone is happy. If the device doesn’t work, the customer sooner or later finds out.

They might get a free repair or replacement, if the device is still under warranty. If it’s outside of the warranty window, it depends on your company’s policy of post-warranty support, or possibly a 3rd party that made it a business supporting your older equipment.

Expectations for connected IoT devices

For connected products, the bar is set higher. The expectations of customers are higher because they no longer pay only for “a box”, but they also pay a monthly subscription.

From an engineering perspective, it’s no longer just hardware tightly coupled with software; it’s now hardware and software-as-a-service (SaaS). And while hardware has a warranty, service has an service level agreement (SLA). These three letters cause a lot of headaches if not executed right. They will turn a successful proof of concept into a production disaster.

Challenges of meeting an SLA

A transformation needs to happen across your organization in order to provide a good service.  Most of these steps will not be discovered in the proof of concept phase. Only an organization that works across teams will survive this transition.

While proving out your product’s viability, the scale you are dealing with is orders of magnitudes smaller than your production scale. In the early days, your engineering team handles both development and operations, because none of those are refined enough to be handed over to the operations team.

If you forego thinking about the transition to production, you are in for a surprise! It is critical to prove both the hardware and the software-as-a-service can be supported and operated at scale.

The following are a handful of critical checks you should do before wrapping up your proof-of-concept.

SLA means regular software updates and quick bug fixes

With any large population of connected devices, your customer will need guarantees of security of the system, and therefore require periodic updates and a commitment to timely fixes of critical vulnerabilities.

To do that at scale, you need a system that can support rolling out a large volume of over-the-air (OTA) updates in a very short time, while keeping the cloud and personnel costs at commercially sensible levels. That also means you must be able to quickly identify which devices require an update, which are a priority, and which ones can wait. In summary, you need a flexible, scalable rollout system.

SLA means 3rd party issues are your issues

Where traditionally you would have full end-to-end control of your product, there is now a 3rd party component to your problems: a network. It is unlikely you or your customer fully control the end-to-end communication path between the IoT device and the cloud application. It is even less likely that you can do anything to solve those network problems, apart from picking up a phone and calling the network provider. That does not scale. Nonetheless, what customers see is your device and your application, so it is your problem when it doesn’t work–whatever the reason.

An obvious mitigation strategy is to have solid contracts in place with all the vendors, but even that is not bulletproof or commercially feasible. My experience is that even with great contracts, most real-world situations end up in a finger-pointing discussion.

In those situations, the best you can do is arm yourself with sophisticated tools and evidence.

Both your device and your cloud application need to be equipped with network diagnostics tools, not only for the initial installation purposes, but also for long term operations.

Ideally, your device will be able to self-recover from frequently encountered (or transient) problems, while caching data it was unable to communicate during the outage. After a major outage, you better make sure devices reconnect in an orderly fashion, otherwise your cloud will crash as soon as your devices recover. And then your devices crash, and you end up in a loop of engineering horror and customer disappointment.

Where automated recovery is not possible, an alert to responsible personnel should be made as soon as possible from the cloud side. At the same time, you don’t want to overload your support with false alarms or transient problems, as that will ultimately make them ignore the alarms when they are actually needed. Your tooling needs to be smart enough to assess the (un)certainty and urgency of the problems.

SLA means resolving problems on time

Historically, response (not resolution) time has been a parameter that defined a good SLA. Response time in simple terms meant how long did it take your support team to react to a reported issue. However, a reaction doesn’t mean much if the customer is left stranded for weeks without a resolution. Therefore, more and more companies want a higher degree of confidence to sign a contract, and will require commitment to resolution time.

In the past shipping back a device would have been an option. Today, you will be expected to resolve problems remotely.

At scale, you certainly don’t want to remotely access the console of each individual device. If you have 100 devices and a patch takes you 10 minutes, that’s 1000 minutes. That’s 16 developer-hours. At 1000 devices, that’s 166 developer-hours–almost a month of regular 8 hour work-days! Will your customers wait one month to have their problem fixed? Or is it worth employing a four person team to bring that number down to one business week (and still have unhappy customers)?

What you need in this case is a system that can

  • Get you information about the entire population of devices efficiently
  • Lets you perform actions across a large number of devices at scale
  • Report back per-device results at scale

With those tools, a single engineer will have the power of an entire team.

SLA requires efficient root cause analysis

While you might have resolved the problem for your specific customer, you don’t want to stop there. There are other customers having the same problems, they just don’t know it yet, or haven’t reported it.

That’s where you will benefit from a continuous stream of logs and operational metrics from your devices that can be browsed at scale. Let’s say a specific version of software on your device randomly crashes. With a good set of logs and metrics, you could quickly narrow down the problem to something like high CPU usage, low memory, or specific revision of hardware coupled with specific history of updates.

The holy grail of exceeding SLAs

The reactive approach outlined above still carries a lot of cost and risk. But it’s a starting point from which you can get to the optimal solution. Ideally, you should be able to find problems ahead of time and prevent them, rather than reacting to them. That’s where having access to large volumes of historical monitoring data and logs will unlock a ton of opportunities.
With a detailed knowledge of what happened with each of your devices over their lifetime, it becomes much easier to start seeing patterns, and to connect the dots.

Ultimately, a couple of months of work of a data engineer (together with a domain expert) can get you business insights beyond what you might be able to imagine.

Conclusion

The challenge of transforming a hardware-based business into a hardware and software-as-a-service business is tremendous. Proofs of concept are therefore omnipresent, but a successful hardware trial is only the beginning of a long journey to a scalable, profitable product line. Ultimately, your company’s reputation as a “cloud” business is at stake.

Obviously I have been thinking deeply about this topic as we build Golioth to make IoT products easier to bootstrap, and much easier to maintain. In future posts I’ll dig into the details of how we address each issue raised here. The important thing today is that we all embrace the concept that adding connectivity is a fundamental change, and not just a hardware revision.

Before you start your migration from proof-of-concept to production (and deployment), make sure you have all the infrastructure in place to support rapid growth. Great engineers that made the early trials successful require tools and support to handle issues at production scale.

A successful proof-of-concept or prototype should prove scalability. It should demonstrate your company’s ability to exceed service level expectations of your customers, and make profit for the company building a connected device.