This article is an introduction to the concept of “The Five Clouds of IoT.” It’s a tool I’ve been developing for some time as a way to help people trying to understand the role of the cloud. This is a mental model I developed as a way to explain service offerings to people interested in building fleets of IoT devices. It includes services developed in-house and services they choose to purchase from external providers.

Today I will discuss the genesis of these classifications and why there is more than one cloud type in the first place. In future articles, I’ll take a deep dive into each of these clouds to compare and contrast their features and look at example services.

I believe it’s important for us to focus our understanding of these broadly-general terms (like cloud and IoT). Doing so lets us define what problems need to be solved in the IoT space, and how best to approach them.

Giving advice about IoT Cloud Services

I have been building internet connected (IoT) devices for a long time. When someone comes to me for advice on their startup or business enhancement, they typically ask for guidance on selecting from different service offerings in the marketplace. There have been, and continue to be, many options out there.

The conversation usually starts out with, “should I use foo for software updates? Or is bar better because it includes more stuff?” They’re coming from the perspective of looking for solutions without understanding the problems they need to solve in the first place.

So as a first step I ask questions about their needs. Questions like:

  • How are you connecting to the internet?
  • What kind of device are you building on?
  • What’s the scale of your deployment?
  • What’s your cost and cost structure?
  • What is your business model?

But the most important questions to answer for picking cloud services are:

  • What are you willing to outsource?
  • What should you outsource?
  • What would get you to market faster if you went with a provider?

From there we can have a more grounded conversation on which aspects of the cloud are needed and which combination of providers they might want to evaluate. Very few service offerings will cover every aspect of an IoT deployment and business needs.

The more of these conversation I’ve had, the more I started to realize that there might be different types of IoT cloud solutions. In fact, I think there are five. And while in practice different cloud companies might offer overlapping features, describing them as separate clouds will help us better understand the core strength of each cloud type and why we might want to use them.

Why are there “Five Clouds of IoT”?

Here are The Five Clouds of IoT, as I define them:

  • The device cloud
  • The connectivity cloud
  • The data cloud
  • The application cloud
  • The development cloud

“You’re just making these up!”, you say? Yes and no. These are classifications based on the service offerings of companies throughout the ecosystem, including the company I founded a couple of years ago (Golioth).  But at the end of the day, these are my classifications and how I view the IoT ecosystem…so yeah, I’m making these up.

However, each cloud represents a business case being served. Each cloud type I listed has a prime example of at least one company serving a particular area of focus. That’s how big the market is and how much need exists for services.

Bringing IoT to the Masses

Each IoT cloud solution helps make deployments possible without massive internal cloud teams at a device maker or service company in the IoT space. Someone who wants to keep their core team small can do so by hiring out parts of the business to different service companies representing one or more of the Five Clouds. They focus on a particular part of the business and lean on service providers to help scale their application.

We’ve already mentioned Golioth as one example of a device cloud. We’re a device cloud because we focus on the management & security of devices and the data they produce. An example of a different cloud would be Soracom, which is a connectivity cloud that addresses different aspects of connecting devices to the internet that spans SIMs, data plans, VPNs, and more.

The key here is that the fundamental focus of a cloud like Golioth and Soracom are different:  device offerings vs connectivity offerings. You need to understand what problem you want to solve when choosing between clouds. For example, why you’d want to leverage Golioth, Soracom, or both.

Which Cloud is right for you?

In future articles in this series, I’ll take a deep dive into each individual cloud and look at example companies that exemplify the characteristics of each cloud. As you start to put together your business needs for your company or your next startup idea, you will be able to piece together which clouds you will need, and find some good companies that can help you achieve your goals.

Why is it so hard to build IoT products? Beyond the many technical challenges during R&D, companies accustomed to building products that are not connected have to go through a challenge of cultural transformation. A big part of this is shifting from selling a standalone hardware product to selling a hardware product with a software service. This move a company from the low-bar of warranty-based service to a company working with service level agreements (SLA) and all their challenges. Let’s explore how this impacts company culture inside these organizations.

Building great, but not connected devices

If your company is used to shipping devices that are not connected, it’s likely the structure of your operations is fairly straightforward. If the device works, everyone is happy. If the device doesn’t work, the customer sooner or later finds out.

They might get a free repair or replacement, if the device is still under warranty. If it’s outside of the warranty window, it depends on your company’s policy of post-warranty support, or possibly a 3rd party that made it a business supporting your older equipment.

Expectations for connected IoT devices

For connected products, the bar is set higher. The expectations of customers are higher because they no longer pay only for “a box”, but they also pay a monthly subscription.

From an engineering perspective, it’s no longer just hardware tightly coupled with software; it’s now hardware and software-as-a-service (SaaS). And while hardware has a warranty, service has an service level agreement (SLA). These three letters cause a lot of headaches if not executed right. They will turn a successful proof of concept into a production disaster.

Challenges of meeting an SLA

A transformation needs to happen across your organization in order to provide a good service.  Most of these steps will not be discovered in the proof of concept phase. Only an organization that works across teams will survive this transition.

While proving out your product’s viability, the scale you are dealing with is orders of magnitudes smaller than your production scale. In the early days, your engineering team handles both development and operations, because none of those are refined enough to be handed over to the operations team.

If you forego thinking about the transition to production, you are in for a surprise! It is critical to prove both the hardware and the software-as-a-service can be supported and operated at scale.

The following are a handful of critical checks you should do before wrapping up your proof-of-concept.

SLA means regular software updates and quick bug fixes

With any large population of connected devices, your customer will need guarantees of security of the system, and therefore require periodic updates and a commitment to timely fixes of critical vulnerabilities.

To do that at scale, you need a system that can support rolling out a large volume of over-the-air (OTA) updates in a very short time, while keeping the cloud and personnel costs at commercially sensible levels. That also means you must be able to quickly identify which devices require an update, which are a priority, and which ones can wait. In summary, you need a flexible, scalable rollout system.

SLA means 3rd party issues are your issues

Where traditionally you would have full end-to-end control of your product, there is now a 3rd party component to your problems: a network. It is unlikely you or your customer fully control the end-to-end communication path between the IoT device and the cloud application. It is even less likely that you can do anything to solve those network problems, apart from picking up a phone and calling the network provider. That does not scale. Nonetheless, what customers see is your device and your application, so it is your problem when it doesn’t work–whatever the reason.

An obvious mitigation strategy is to have solid contracts in place with all the vendors, but even that is not bulletproof or commercially feasible. My experience is that even with great contracts, most real-world situations end up in a finger-pointing discussion.

In those situations, the best you can do is arm yourself with sophisticated tools and evidence.

Both your device and your cloud application need to be equipped with network diagnostics tools, not only for the initial installation purposes, but also for long term operations.

Ideally, your device will be able to self-recover from frequently encountered (or transient) problems, while caching data it was unable to communicate during the outage. After a major outage, you better make sure devices reconnect in an orderly fashion, otherwise your cloud will crash as soon as your devices recover. And then your devices crash, and you end up in a loop of engineering horror and customer disappointment.

Where automated recovery is not possible, an alert to responsible personnel should be made as soon as possible from the cloud side. At the same time, you don’t want to overload your support with false alarms or transient problems, as that will ultimately make them ignore the alarms when they are actually needed. Your tooling needs to be smart enough to assess the (un)certainty and urgency of the problems.

SLA means resolving problems on time

Historically, response (not resolution) time has been a parameter that defined a good SLA. Response time in simple terms meant how long did it take your support team to react to a reported issue. However, a reaction doesn’t mean much if the customer is left stranded for weeks without a resolution. Therefore, more and more companies want a higher degree of confidence to sign a contract, and will require commitment to resolution time.

In the past shipping back a device would have been an option. Today, you will be expected to resolve problems remotely.

At scale, you certainly don’t want to remotely access the console of each individual device. If you have 100 devices and a patch takes you 10 minutes, that’s 1000 minutes. That’s 16 developer-hours. At 1000 devices, that’s 166 developer-hours–almost a month of regular 8 hour work-days! Will your customers wait one month to have their problem fixed? Or is it worth employing a four person team to bring that number down to one business week (and still have unhappy customers)?

What you need in this case is a system that can

  • Get you information about the entire population of devices efficiently
  • Lets you perform actions across a large number of devices at scale
  • Report back per-device results at scale

With those tools, a single engineer will have the power of an entire team.

SLA requires efficient root cause analysis

While you might have resolved the problem for your specific customer, you don’t want to stop there. There are other customers having the same problems, they just don’t know it yet, or haven’t reported it.

That’s where you will benefit from a continuous stream of logs and operational metrics from your devices that can be browsed at scale. Let’s say a specific version of software on your device randomly crashes. With a good set of logs and metrics, you could quickly narrow down the problem to something like high CPU usage, low memory, or specific revision of hardware coupled with specific history of updates.

The holy grail of exceeding SLAs

The reactive approach outlined above still carries a lot of cost and risk. But it’s a starting point from which you can get to the optimal solution. Ideally, you should be able to find problems ahead of time and prevent them, rather than reacting to them. That’s where having access to large volumes of historical monitoring data and logs will unlock a ton of opportunities.
With a detailed knowledge of what happened with each of your devices over their lifetime, it becomes much easier to start seeing patterns, and to connect the dots.

Ultimately, a couple of months of work of a data engineer (together with a domain expert) can get you business insights beyond what you might be able to imagine.

Conclusion

The challenge of transforming a hardware-based business into a hardware and software-as-a-service business is tremendous. Proofs of concept are therefore omnipresent, but a successful hardware trial is only the beginning of a long journey to a scalable, profitable product line. Ultimately, your company’s reputation as a “cloud” business is at stake.

Obviously I have been thinking deeply about this topic as we build Golioth to make IoT products easier to bootstrap, and much easier to maintain. In future posts I’ll dig into the details of how we address each issue raised here. The important thing today is that we all embrace the concept that adding connectivity is a fundamental change, and not just a hardware revision.

Before you start your migration from proof-of-concept to production (and deployment), make sure you have all the infrastructure in place to support rapid growth. Great engineers that made the early trials successful require tools and support to handle issues at production scale.

A successful proof-of-concept or prototype should prove scalability. It should demonstrate your company’s ability to exceed service level expectations of your customers, and make profit for the company building a connected device.