| # Audio understanding | |
| Source: <https://ai.google.dev/gemini-api/docs/audio> | |
| --- | |
| Gemini can analyze and understand audio input, enabling use cases like the following: | |
| * Describe, summarize, or answer questions about audio content. | |
| * Provide a transcription of the audio. | |
| * Analyze specific segments of the audio. | |
| This guide shows you how to use the Gemini API to generate a text response to audio input. | |
| ### Before you begin | |
| Before calling the Gemini API, ensure you have [your SDK of choice](/gemini-api/docs/downloads) installed, and a [Gemini API key](/gemini-api/docs/api-key) configured and ready to use. | |
| ## Input audio | |
| You can provide audio data to Gemini in the following ways: | |
| * Upload an audio file before making a request to `generateContent`. | |
| * Pass inline audio data with the request to `generateContent`. | |
| ### Upload an audio file | |
| You can use the [Files API](/gemini-api/docs/files) to upload an audio file. Always use the Files API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB. | |
| The following code uploads an audio file and then uses the file in a call to `generateContent`. | |
| from google import genai | |
| client = genai.Client() | |
| myfile = client.files.upload(file="path/to/sample.mp3") | |
| response = client.models.generate_content( | |
| model="gemini-2.5-flash", contents=["Describe this audio clip", myfile] | |
| ) | |
| print(response.text) | |
| To learn more about working with media files, see [Files API](/gemini-api/docs/files). | |
| ### Pass audio data inline | |
| Instead of uploading an audio file, you can pass inline audio data in the request to `generateContent`: | |
| from google.genai import types | |
| with open('path/to/small-sample.mp3', 'rb') as f: | |
| audio_bytes = f.read() | |
| response = client.models.generate_content( | |
| model='gemini-2.5-flash', | |
| contents=[ | |
| 'Describe this audio clip', | |
| types.Part.from_bytes( | |
| data=audio_bytes, | |
| mime_type='audio/mp3', | |
| ) | |
| ] | |
| ) | |
| print(response.text) | |
| A few things to keep in mind about inline audio data: | |
| * The maximum request size is 20 MB, which includes text prompts, system instructions, and files provided inline. If your file's size will make the _total request size_ exceed 20 MB, then use the Files API to upload an audio file for use in the request. | |
| * If you're using an audio sample multiple times, it's more efficient to upload an audio file. | |
| ## Get a transcript | |
| To get a transcript of audio data, just ask for it in the prompt: | |
| myfile = client.files.upload(file='path/to/sample.mp3') | |
| prompt = 'Generate a transcript of the speech.' | |
| response = client.models.generate_content( | |
| model='gemini-2.5-flash', | |
| contents=[prompt, myfile] | |
| ) | |
| print(response.text) | |
| ## Refer to timestamps | |
| You can refer to specific sections of an audio file using timestamps of the form `MM:SS`. For example, the following prompt requests a transcript that | |
| * Starts at 2 minutes 30 seconds from the beginning of the file. | |
| * Ends at 3 minutes 29 seconds from the beginning of the file. | |
| # Create a prompt containing timestamps. | |
| prompt = "Provide a transcript of the speech from 02:30 to 03:29." | |
| ## Count tokens | |
| Call the `countTokens` method to get a count of the number of tokens in an audio file. For example: | |
| response = client.models.count_tokens( | |
| model='gemini-2.5-flash', | |
| contents=[myfile] | |
| ) | |
| print(response) | |
| ## Supported audio formats | |
| Gemini supports the following audio format MIME types: | |
| * WAV - `audio/wav` | |
| * MP3 - `audio/mp3` | |
| * AIFF - `audio/aiff` | |
| * AAC - `audio/aac` | |
| * OGG Vorbis - `audio/ogg` | |
| * FLAC - `audio/flac` | |
| ## Technical details about audio | |
| * Gemini represents each second of audio as 32 tokens; for example, one minute of audio is represented as 1,920 tokens. | |
| * Gemini can "understand" non-speech components, such as birdsong or sirens. | |
| * The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the _number_ of audio files in a single prompt; however, the total combined length of all audio files in a single prompt can't exceed 9.5 hours. | |
| * Gemini downsamples audio files to a 16 Kbps data resolution. | |
| * If the audio source contains multiple channels, Gemini combines those channels into a single channel. | |
| ## What's next | |
| This guide shows how to generate text in response to audio data. To learn more, see the following resources: | |
| * [File prompting strategies](/gemini-api/docs/files#prompt-guide): The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting. | |
| * [System instructions](/gemini-api/docs/text-generation#system-instructions): System instructions let you steer the behavior of the model based on your specific needs and use cases. | |
| * [Safety guidance](/gemini-api/docs/safety-guidance): Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs. | |