If you have ever tried communicating with a device on a private network, you may have encountered Network Address Translation (NAT). Fundamentally, when one device needs to send data to another device, it needs to know how to address it. On IP-based networks, devices are addressed using an IP address. Unfortunately, the number of connected devices has long outpaced the number of unique addresses in the IPv4 address space. Because of this, public IP addresses have to be shared between devices, which causes a few problems.
How to Share an IP Address
You probably accessed this blog post from a machine that does not have a public IP address. Rather, it has been assigned a private IP address on a network, perhaps via the Dynamic Host Configuration Protocol (DHCP), and it talks to a router that is responsible for sending data to and from the device. To access this post, your device first had to use the Domain Name System (DNS) to map a public IP address to blog.golioth.io
, then had to send a request to that IP address for the content of this page.
When that request arrives at a router or some other intermediary, it knows where to deliver the request because the IP address of the server hosting blog.golioth.io
is specified. It forwards the request along, and the server responds with the requested content. However, the server does not know that your device sent the request. The router has replaced the private IP address and port from your device with its own public IP address and port, and it has made an entry in a translation table noting that incoming data for that port should be directed to your device. The server sends the content back to the router, which replaces its own public IP address and port with your device’s IP address and port, then forwards it along. The content arrives at your device, appearing as though the server sent it directly to you. Meanwhile, the router is doing the same song and dance for many other devices, maintaining all of the mappings from its own IP address and ports to internal IP addresses and ports. This is known as Network Address Translation (NAT).
What Could Go Wrong?
This works great in simple request-response scenarios like fetching a blog post from a server with a public IP address. However, what if the server wants to say something to the device before the device talks to it? The server may know the public IP address of the router, but the router has no way of knowing which device the message is actually intended for. There is no entry in the NAT translation table until an outgoing message creates one. This also becomes a problem in peer-to-peer scenarios, where both devices are on a private network, making it such that neither device can talk to the other (this is solved using a public rendezvous point, such as a STUN server, but that’s a story for another post).
Another problem is that routers don’t want to maintain mappings forever. At some point if no outgoing messages have been observed, the entry will be removed from the translation table and any subsequent incoming traffic will be dropped. In many cases, this timeout is quite aggressive (e.g. 5 minutes or less). Typically this is resolved by sending “keep alive” messages, ensuring that entries are not removed and data can flow freely in both directions. For your laptop or a server in a data center that might work fine for the most part. For highly constrained devices, it can quickly drain battery or consume precious limited bandwidth.
Maybe you decide that its okay for incoming traffic to be dropped after some period of time, as long as when you next contact the server you are able to re-establish a mapping and fetch any data that you need. Unfortunately, there is no guarantee that the router, or any other layer in the hierarchy of intermediaries performing NAT (it’s actually much more complicated, with Carrier-Grade NAT adding even more translation steps), will assign you the same public IP address and port. Therefore, when you try to continue talking to the server over a previously established session, it will not recognize you. This means you’ll have to re-establish the session, which typically involves expensive cryptographic operations and sending a handful of messages back and forth before actually delivering the data you were interested in sending originally.
The worst case scenario is that your device needs to send data somewhat frequently, but not frequently enough that NAT mappings are maintained. For example, if a device needs to send a tiny sensor reading every 30 minutes, and the NAT timeout is 5 minutes, it will either need to send a keep alive message every 5 minutes (that’s 5x the messages you actually need to send!), or it will need to re-establish the session every time it delivers a reading. In both cases, you are going to be using much more power than if you were just able to send your sensor reading alone every 30 minutes.
Solving the Problem
Unfortunately, the distributed nature of the internet means that we aren’t going to be able to address the issue by nicely asking carriers and ISPs to extend their NAT timeouts. However, we can make it such that being issued a new IP address and port doesn’t force us to re-establish a session.
More than a year ago, we announced support for DTLS 1.2 Connection IDs. DTLS provides a secure transport over UDP, which many devices, especially those that are power constrained, use to communicate with Golioth’s CoAP device APIs. Typically, DTLS sessions are established based on a “five tuple”: source address, source port, transport protocol, destination address, destination port. If any of these change, a handshake must be performed to establish a new session. To mitigate this overhead, a Connection ID can be negotiated during the initial handshake, and can be used in subsequent records to continue to associate messages even after changes in source IP or port.
Going back to our previous example of a device that sends a single sensor reading message every 30 minutes, enabling Connection ID would mean that a new handshake would not have to be performed after NAT timeout, and that single message can be sent then the device can go back to sleep. In fact, depending on how long the server is willing to store connection state, the device could sleep for much longer, sending once a day or more infrequently. This doesn’t solve the issue of cloud to device traffic being dropped after NAT timeout (check back for another post on that topic), but for many low power use cases, being able to sleep for an extended period of time is less important than being able to immediately push data to devices.
Configuring the Golioth Firmware SDK for Sleepy Devices
By default, the Golioth Firmware SDK will send keep alive messages to ensure that an entry is preserved in the NAT translation table. However, this functionality can be disabled by setting CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL
to 0, or just modifying it to be set to a large upper bound.
CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL_S=0
If using Zephyr, we’ll also need to set the receive timeout to a value greater than the interval at which we will be sending data. Otherwise, the client will attempt to reconnect after 30 seconds by default if it has not received any messages. In this example we’ll send data every 130 seconds, so setting the receive timeout to 200 ensures that we won’t attempt to reconnect between sending.
CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC=200
To demonstrate the impact of NAT timeouts, we’ll initially build the hello
example without enabling Connection IDs. To ensure that we wait long enough for a NAT timeout, we need to update the loop to send every 130 seconds instead of every 5 seconds.
This example is using a Hologram SIM and connecting via the AT&T network. NAT timeouts may vary from one carrier to another. AT&T currently documents UDP inactivity timeouts as 30 seconds.
while (true)
{
LOG_INF("Sending hello! %d", counter);
++counter;
k_sleep(K_SECONDS(130));
}
Building and flashing the hello
sample on a Nordic Thingy91 results in the following behavior.
*** Booting nRF Connect SDK v2.7.0-5cb85570ca43 ***
*** Using Zephyr OS v3.6.99-100befc70c74 ***
[00:00:00.506,378] <dbg> hello_zephyr: main: start hello sample
[00:00:00.506,378] <inf> golioth_samples: Bringing up network interface
[00:00:00.506,408] <inf> golioth_samples: Waiting to obtain IP address
[00:00:13.236,877] <inf> lte_monitor: Network: Searching
[00:00:17.593,994] <inf> lte_monitor: Network: Registered (roaming)
[00:00:17.594,696] <inf> golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
[00:00:18.839,904] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:00:18.840,118] <inf> hello_zephyr: Sending hello! 0
[00:00:18.840,179] <inf> hello_zephyr: Golioth client connected
[00:00:18.840,270] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[00:02:28.840,209] <inf> hello_zephyr: Sending hello! 1
[00:02:32.194,396] <wrn> golioth_coap_client: 1 resends in last 10 seconds
[00:02:46.252,868] <wrn> golioth_coap_client: 4 resends in last 10 seconds
[00:03:03.419,219] <wrn> golioth_coap_client: 3 resends in last 10 seconds
[00:03:04.986,389] <wrn> golioth_coap_client: Packet 0x2001e848 (reply 0x2001e890) was not replied to
[00:03:06.045,715] <wrn> golioth_coap_client: Packet 0x2001e638 (reply 0x2001e680) was not replied to
[00:03:15.213,592] <wrn> golioth_coap_client: 6 resends in last 10 seconds
[00:03:21.874,298] <wrn> golioth_coap_client: Packet 0x2001ec90 (reply 0x2001ecd8) was not replied to
[00:03:25.419,921] <wrn> golioth_coap_client: 5 resends in last 10 seconds
[00:03:36.565,765] <wrn> golioth_coap_client: 5 resends in last 10 seconds
[00:03:40.356,933] <wrn> golioth_coap_client_zephyr: Receive timeout
[00:03:40.356,964] <inf> golioth_coap_client_zephyr: Ending session
[00:03:40.356,994] <inf> hello_zephyr: Golioth client disconnected
[00:03:47.035,675] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:03:47.035,705] <inf> hello_zephyr: Golioth client connected
[00:03:47.035,827] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
After initially connecting and successfully sending Sending hello! 0
, we are inactive for 130 seconds (00:18
to 02:28
), then when we attempt to send Sending hello! 1
, we see that the server never responds, eventually causing us to reach the Receive timeout
and reconnect. This is because when we send Sending hello! 1
, our entry has been removed from the NAT translation table, and when we are assigned a new public IP address and port the server is unable to associate messages with the existing DTLS session.
Because using Connection IDs does involve sending extra data in every message, it is disabled in the Golioth Firmware SDK by default. In scenarios such as this one where the few extra bytes clearly outweigh more frequent handshakes, Connection IDs can be enabled with CONFIG_GOLIOTH_USE_CONNECTION_ID
.
CONFIG_GOLIOTH_USE_CONNECTION_ID=y
Now when we build and flash the hello
example on a Thingy91, we can see our 130 second delay, but then the successful delivery of Sending hello! 1
. 130 seconds later, we see another successful delivery of Sending hello! 2
.
*** Booting nRF Connect SDK v2.7.0-5cb85570ca43 ***
*** Using Zephyr OS v3.6.99-100befc70c74 ***
[00:00:00.508,636] <dbg> hello_zephyr: main: start hello sample
[00:00:00.508,666] <inf> golioth_samples: Bringing up network interface
[00:00:00.508,666] <inf> golioth_samples: Waiting to obtain IP address
[00:00:13.220,001] <inf> lte_monitor: Network: Searching
[00:00:16.318,908] <inf> lte_monitor: Network: Registered (roaming)
[00:00:16.319,641] <inf> golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
[00:00:21.435,180] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[00:00:21.435,394] <inf> hello_zephyr: Sending hello! 0
[00:00:21.435,424] <inf> hello_zephyr: Golioth client connected
[00:00:21.435,546] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[00:02:31.435,455] <inf> hello_zephyr: Sending hello! 1
[00:04:41.435,546] <inf> hello_zephyr: Sending hello! 2
Next Steps
To see how often your devices are being forced to reconnect to Golioth after periods of inactivity, check out our documentation on device connectivity metrics. Devices that effectively maintain long lasting connections will see a significant difference between their Session Established
and Last Report
timestamps. If you have any questions about optimizing your devices for low power, reach out to us on the forum!