Why is it so hard to build IoT products? Beyond the many technical challenges during R&D, companies accustomed to building products that are not connected have to go through a challenge of cultural transformation. A big part of this is shifting from selling a standalone hardware product to selling a hardware product with a software service. This move a company from the low-bar of warranty-based service to a company working with service level agreements (SLA) and all their challenges. Let’s explore how this impacts company culture inside these organizations.
Building great, but not connected devices
If your company is used to shipping devices that are not connected, it’s likely the structure of your operations is fairly straightforward. If the device works, everyone is happy. If the device doesn’t work, the customer sooner or later finds out.
They might get a free repair or replacement, if the device is still under warranty. If it’s outside of the warranty window, it depends on your company’s policy of post-warranty support, or possibly a 3rd party that made it a business supporting your older equipment.
Expectations for connected IoT devices
For connected products, the bar is set higher. The expectations of customers are higher because they no longer pay only for “a box”, but they also pay a monthly subscription.
From an engineering perspective, it’s no longer just hardware tightly coupled with software; it’s now hardware and software-as-a-service (SaaS). And while hardware has a warranty, service has an service level agreement (SLA). These three letters cause a lot of headaches if not executed right. They will turn a successful proof of concept into a production disaster.
Challenges of meeting an SLA
A transformation needs to happen across your organization in order to provide a good service. Most of these steps will not be discovered in the proof of concept phase. Only an organization that works across teams will survive this transition.
While proving out your product’s viability, the scale you are dealing with is orders of magnitudes smaller than your production scale. In the early days, your engineering team handles both development and operations, because none of those are refined enough to be handed over to the operations team.
If you forego thinking about the transition to production, you are in for a surprise! It is critical to prove both the hardware and the software-as-a-service can be supported and operated at scale.
The following are a handful of critical checks you should do before wrapping up your proof-of-concept.
SLA means regular software updates and quick bug fixes
With any large population of connected devices, your customer will need guarantees of security of the system, and therefore require periodic updates and a commitment to timely fixes of critical vulnerabilities.
To do that at scale, you need a system that can support rolling out a large volume of over-the-air (OTA) updates in a very short time, while keeping the cloud and personnel costs at commercially sensible levels. That also means you must be able to quickly identify which devices require an update, which are a priority, and which ones can wait. In summary, you need a flexible, scalable rollout system.
SLA means 3rd party issues are your issues
Where traditionally you would have full end-to-end control of your product, there is now a 3rd party component to your problems: a network. It is unlikely you or your customer fully control the end-to-end communication path between the IoT device and the cloud application. It is even less likely that you can do anything to solve those network problems, apart from picking up a phone and calling the network provider. That does not scale. Nonetheless, what customers see is your device and your application, so it is your problem when it doesn’t work–whatever the reason.
An obvious mitigation strategy is to have solid contracts in place with all the vendors, but even that is not bulletproof or commercially feasible. My experience is that even with great contracts, most real-world situations end up in a finger-pointing discussion.
In those situations, the best you can do is arm yourself with sophisticated tools and evidence.
Both your device and your cloud application need to be equipped with network diagnostics tools, not only for the initial installation purposes, but also for long term operations.
Ideally, your device will be able to self-recover from frequently encountered (or transient) problems, while caching data it was unable to communicate during the outage. After a major outage, you better make sure devices reconnect in an orderly fashion, otherwise your cloud will crash as soon as your devices recover. And then your devices crash, and you end up in a loop of engineering horror and customer disappointment.
Where automated recovery is not possible, an alert to responsible personnel should be made as soon as possible from the cloud side. At the same time, you don’t want to overload your support with false alarms or transient problems, as that will ultimately make them ignore the alarms when they are actually needed. Your tooling needs to be smart enough to assess the (un)certainty and urgency of the problems.
SLA means resolving problems on time
Historically, response (not resolution) time has been a parameter that defined a good SLA. Response time in simple terms meant how long did it take your support team to react to a reported issue. However, a reaction doesn’t mean much if the customer is left stranded for weeks without a resolution. Therefore, more and more companies want a higher degree of confidence to sign a contract, and will require commitment to resolution time.
In the past shipping back a device would have been an option. Today, you will be expected to resolve problems remotely.
At scale, you certainly don’t want to remotely access the console of each individual device. If you have 100 devices and a patch takes you 10 minutes, that’s 1000 minutes. That’s 16 developer-hours. At 1000 devices, that’s 166 developer-hours–almost a month of regular 8 hour work-days! Will your customers wait one month to have their problem fixed? Or is it worth employing a four person team to bring that number down to one business week (and still have unhappy customers)?
What you need in this case is a system that can
- Get you information about the entire population of devices efficiently
- Lets you perform actions across a large number of devices at scale
- Report back per-device results at scale
With those tools, a single engineer will have the power of an entire team.
SLA requires efficient root cause analysis
While you might have resolved the problem for your specific customer, you don’t want to stop there. There are other customers having the same problems, they just don’t know it yet, or haven’t reported it.
That’s where you will benefit from a continuous stream of logs and operational metrics from your devices that can be browsed at scale. Let’s say a specific version of software on your device randomly crashes. With a good set of logs and metrics, you could quickly narrow down the problem to something like high CPU usage, low memory, or specific revision of hardware coupled with specific history of updates.
The holy grail of exceeding SLAs
The reactive approach outlined above still carries a lot of cost and risk. But it’s a starting point from which you can get to the optimal solution. Ideally, you should be able to find problems ahead of time and prevent them, rather than reacting to them. That’s where having access to large volumes of historical monitoring data and logs will unlock a ton of opportunities.
With a detailed knowledge of what happened with each of your devices over their lifetime, it becomes much easier to start seeing patterns, and to connect the dots.
Ultimately, a couple of months of work of a data engineer (together with a domain expert) can get you business insights beyond what you might be able to imagine.
The challenge of transforming a hardware-based business into a hardware and software-as-a-service business is tremendous. Proofs of concept are therefore omnipresent, but a successful hardware trial is only the beginning of a long journey to a scalable, profitable product line. Ultimately, your company’s reputation as a “cloud” business is at stake.
Obviously I have been thinking deeply about this topic as we build Golioth to make IoT products easier to bootstrap, and much easier to maintain. In future posts I’ll dig into the details of how we address each issue raised here. The important thing today is that we all embrace the concept that adding connectivity is a fundamental change, and not just a hardware revision.
Before you start your migration from proof-of-concept to production (and deployment), make sure you have all the infrastructure in place to support rapid growth. Great engineers that made the early trials successful require tools and support to handle issues at production scale.
A successful proof-of-concept or prototype should prove scalability. It should demonstrate your company’s ability to exceed service level expectations of your customers, and make profit for the company building a connected device.