NIP-A0: Voice Messages
NIP-A0 gives short voice notes their own Nostr event kinds, with direct audio URLs, reply structure through NIP-22 and optional waveform and duration metadata through imeta.
Voice notes need a shape that is not just a file link
Voice messages sit between chat, social posting and media. A user expects them to feel immediate, short and playable in-line. A bare MP4 or OGG URL inside a normal note does not tell a client whether it should render a voice-note player, show a waveform, limit duration or treat the event as a reply.
NIP-A0 defines root voice messages as kind 1222 and voice replies as kind 1244. The content is a direct URL to an audio file. The spec recommends short recordings, typically no longer than 60 seconds, and recommends audio/mp4 with AAC or Opus for broad compatibility.
The event stays simple on purpose. It does not define live audio, rooms or long podcast episodes. It defines the small product behavior people recognize from messaging apps: press, speak, send, listen.
Two event kinds plus optional waveform metadata
Kind 1222 is the root voice message. Kind 1244 is the reply form and must follow the NIP-22 comment structure. Both put the audio URL in content. Tags can include normal Nostr metadata such as hashtags, geohashes or reply references where relevant.
The optional visual layer comes through NIP-92 imeta. A voice message can include waveform values and duration so a client can draw a compact audio preview without first downloading the whole file.
That waveform field matters for UX. A voice note without duration or visual shape feels like a random download. A voice note with metadata feels like a native conversation object.
Fabian added the voice-message NIP in July 2025
The visible file history is short. Fabian added NIP-A0 in July 2025 through PR #1984, then updated the audio format and waveform recommendation days later through PR #1990. That makes it one of the younger NIPs in the current set.
The official example uses a Blossom URL, which is a useful clue about the media stack around it. NIP-A0 defines the voice event; Blossom or another media server still hosts the audio file. NIP-92 supplies the metadata language.
For readers, that means the standard is not a full voice platform. It is a small bridge between short-form audio UX and the existing Nostr media system.
Clients should make recording limits visible
A good implementation enforces or clearly warns about the 60-second expectation, uploads to a media server, writes an imeta tag with duration and waveform, and renders a small player that does not surprise the user with large downloads.
Reply voice messages should behave like comments through NIP-22 so they can attach to the right parent. If a client treats every voice file as a standalone social post, conversations become fragmented.
Accessibility also matters. A future client may add transcript metadata, but even now the UI can show duration, playback speed and an obvious pause state.
Voice adds privacy and moderation weight
Voice can reveal identity, background sound, location clues and emotion more directly than text. Clients should not treat uploads casually, especially when files are stored on public media servers.
Moderation is harder too. Audio abuse is less searchable than text, so communities need reporting and playback-safe defaults.
Direct sources
Use the official file first, then the commit history, implementation references and adjacent standards. NIPs move, and product guidance gets weaker when those source trails are hidden.





