Using the Hugging Face Inference API for Device Audio Analysis
Golioth Pipelines works with Hugging Face, as shown in our recent AI launch. This post will highlight how to use an audio classification model on Hugging Face that accepts data recorded on a microcontroller-based device, sent over a secure network connection to Golioth, and routed through Pipelines.
While most commonly known as the place where models and data sets are uploaded and shared, Hugging Face also provides a compute service in the form of its free serverless inference API and production-ready dedicated inference endpoints. Unlike other platforms that offer only proprietary models, Hugging Face allows access to over 150,000 open source models via its inference APIs. Additionally, private models can be hosted on Hugging Face, which is a common use case for Golioth users that have trained models on data collected from their device fleets.
Audio Analysis with Pipelines
Because the Hugging Face inference APIs use HTTP, they are easy to target with the webhook
transformer. The structure of the request body will depend on the model being invoked, but for models that operate on media files, such as audio or video, the payload is typically raw binary data.
In the following pipeline, we target the serverless inference API with an audio sample streamed from a device. In this scenario, we want to perform sentiment analysis of the audio, then pass the results onto Golioth’s timeseries database, LightDB Stream, so that changes in sentiment can be observed over time. An alternative destination, or multiple destinations, could easily be added.
Click here to use this pipeline in your project on Golioth.
filter: path: "/audio" steps: - name: emotion-recognition transformer: type: webhook version: v1 parameters: url: https://api-inference.huggingface.co/models/superb/hubert-large-superb-er headers: Authorization: $HUGGING_FACE_TOKEN - name: embed transformer: type: embed-in-json version: v1 parameters: key: text - name: send-lightdb-stream destination: type: lightdb-stream version: v1
Note that though Hugging Face’s serverless inference API is free to use, it is rate-limited and subject to high latency and intermittent failures due to cold starts. For production use-cases, dedicated inference endpoints are recommended.
We can pick any supported model on Hugging Face for our audio analysis task. As shown in the URL, the Hubert-Large for Emotion Recognition model is targeted, and the audio content delivered on path /audio
is delivered directly to Hugging Face. An example for how to upload audio to Golioth using an ESP32 can be found here.
Results from the emotion recognition inference look as follows.
[ { "score": 0.6310836672782898, "label": "neu" }, { "score": 0.2573806643486023, "label": "sad" }, { "score": 0.09393830597400665, "label": "hap" }, { "score": 0.017597444355487823, "label": "ang" } ]
Expanding Capabilities
Countless models are uploaded to Hugging Face on a daily basis, and the inference API integration with Golioth Pipelines makes it simple to incorporate the latest new functionality into any connected device product. Let us know what models you are using on the Golioth Forum!
Start the discussion at forum.golioth.io