Realtime audio content tagging to improve discoverability.

BETA - This API is in Beta. This means that while we will ensure availability to a more-than-reasonable degree, there may be unforeseen issues or outages.

You MUST have mechanisms in place to account for results not being returned due to an error, or requests timing out/running indefinitely.

Audio content tagging is available for audio and video streams that are producing live media. If authorization is required, we require that all stream targets be pre-signed to enable us to access them. We currently have limited support for authorization headers. For now, you may pass a Bearer authorization header with your request (passed in your payload body as described below).

The platform will consume your stream target, and listen for 60 seconds. This way, you can easily request a new tagging session of a stream target at a later stage - making it cost effective and easy to get updated results.

Available Tags

We keep the tags we use available over a JSON endpoint.


New Tags can be added on demand, as well as privately scoped tag collections upon request.


Rapid Samples - Currently we operate in a way that lets you "dip in and out" of stream targets. This means that every API call will create a 1 minute tagging session on a stream target. When you space this calls out over the lifetime of a stream target, you can adjust for changes in the topic or mood of the user content.

Reliability - Reliability and accuracy of tagging is an important goal for this API. However, this is entirely dependent on the conversation being presented. Our provided STT accuracy is exceptional, but if the conversation is lacking any definition, or is purely a steady stream of short utterances - accuracy will be poor. If, after 60 seconds, there wasn't much audio content - or the returned tags are of poor quality, you should wait for awhile to skip past a "lull" phase in the conversation.

Seed Sentence - To improve reliability, you can provide a seed sentence the be evaluated with the resulting STT corpus. For example, you can include the text title of the room, in addition to any other textual metadata you consider relevant in your use case. This is documented below as seedSentence in the creating a tagging session API.

OOM - Out of memory. If your session reaches an OOM event, the session will terminate without providing a result. We provision memory for each session based on the average requirements of users in the last 7 days. If you provide a video stream that is in, for example, 2160p - this puts incredible strain on our scheduler, and is completely wasteful in obtaining audio data. If you use video streams as your stream target, please be mindful of this fact - as you will still be charged for the audio processing time up until an OOM event has occurred. Sending a video stream with a quality of 360p, is entirely enough to satisfy audio quality while allowing us to have fast turn-around-times for your sessions. When sending audio streams, rarely, if ever do you need to worry about OOM events.

Audio Sample Rate - Audio sampling rates are a complicated subject, and do not need to be understood in full to use this API. Audio sampling rates are measured in hertz. An MP3 would have a sampling rate of 44,100hz. The optimal sample rate sits above and around 16,000hz for audio speech-to-text. Do not up sample, or down sample audio data. Internally, we transcode audio formats and audio sample rates, so generally you do not need to worry about this.

Audio Formats - The following target stream formats are accepted: m3u8 (only #EXT-X-STREAM-INF directive links), mp4, h264, h263, h262, avc1, mp4a, ts (by providing a m3u8 url), ogg, AAC, HLS. We support more formats than listed, reach out if you would like to inquire about different format support.