OpenAI Whisper API Pricing 2026: Cost Per Minute, GPT-4o Transcribe and Cheaper Alternatives

Updated on: June 8, 2026 by Steven Jones

Originally Published on: February 15, 2026

OpenAI Whisper API pricing in 2026 is no longer a single “$0.006 per minute” answer. That rate still matters for whisper-1 and gpt-4o-transcribe, but OpenAI now also lists gpt-4o-mini-transcribe at about $0.003 per minute and gpt-realtime-whisper at about $0.017 per minute for live transcription.

That changes the buying decision. A podcast archive, meeting recorder, subtitle workflow, research library or SaaS transcription feature should not be priced from the headline rate alone. The real cost depends on batch versus realtime transcription, model choice, file duration, silence trimming, speaker diarisation, retries, transcript cleanup and any downstream summarisation.

This updated guide gives the price answer first, then shows how to model monthly costs, when to use each OpenAI transcription route, how OpenAI compares with Deepgram, AssemblyAI, Google and AWS, and where cheaper options make sense. The provider scores use DIY AI’s 2026 speech-to-text dataset, the same framework used in our best AI speech-to-text tools guide.

OpenAI Whisper API pricing: the quick answer

As of June 2026, the practical OpenAI transcription pricing range is about $0.003 to $0.006 per minute for standard file transcription, and about $0.017 per minute for gpt-realtime-whisper when you need live transcript deltas. Always verify the live rate on the official OpenAI API pricing page before putting numbers into a client quote, procurement document or production margin model.

OpenAI transcription option	Current price to model	Estimated hourly cost	Best for	Main trade-off
whisper-1	$0.006/minute	$0.36/hour	Existing Whisper integrations and simple file transcription	Older model path, not the lowest-cost OpenAI option
gpt-4o-mini-transcribe	$0.003/minute	$0.18/hour	Lowest-cost managed OpenAI transcription for clean batch audio	Needs testing on noisy calls, accents and technical vocabulary
gpt-4o-transcribe	$0.006/minute	$0.36/hour	Higher-quality managed transcription inside the OpenAI API	Twice the base cost of the mini transcription route
gpt-realtime-whisper	$0.017/minute	$1.02/hour	Live speech-to-text, captions and transcript deltas	Much more expensive than batch transcription
Self-hosted Whisper	Infrastructure-dependent	Infrastructure-dependent	High-volume teams with GPU capacity and engineering support	You own scaling, queues, monitoring, failure recovery and model updates

The important point is not just that mini is cheaper. It is that OpenAI now splits the decision into cheap batch transcription, higher-quality batch transcription, legacy Whisper compatibility and realtime transcription. Those are different workflows, not four names for the same job.

OpenAI Whisper cost calculator

Audio Duration (Minutes per Month):

Include Speaker Diarization? (Est. +$0.002/min via 3rd party)

OpenAI API ($0.006)

$6.00

Deepgram Nova-2 ($0.0043)

$4.30

Self-Hosted (DIY)

$0.50

(Est. GPU costs)

*Self-hosted estimates based on $0.03/hr GPU processing speed.

Monthly transcription cost = uploaded audio minutes x model price per minute.

Audio volume	Cost at $0.003/minute	Cost at $0.006/minute	Cost at $0.017/minute	Practical meaning
100 minutes	$0.30	$0.60	$1.70	Small tests, demos and occasional uploads
500 minutes	$1.50	$3.00	$8.50	Small podcast archive or research notes
3,000 minutes	$9.00	$18.00	$51.00	Regular calls, interviews or internal meeting transcription
30,000 minutes	$90.00	$180.00	$510.00	SaaS feature or serious content workflow
300,000 minutes	$900.00	$1,800.00	$5,100.00	Enterprise-scale usage where routing and vendor choice matter

For a one-hour file, the quick mental maths is easy: $0.18 on gpt-4o-mini-transcribe, $0.36 on whisper-1 or gpt-4o-transcribe, and $1.02 if you model the same hour as gpt-realtime-whisper. That does not include storage, audio conversion, diarisation, retries, human review or later AI analysis.

Whisper-1 vs GPT-4o Transcribe vs GPT-Realtime-Whisper

A lot of people still search for “OpenAI Whisper pricing” because Whisper made OpenAI transcription popular. The API decision in 2026 is broader.

Model or route	Role in 2026	Use it when	Avoid it when
whisper-1	Legacy OpenAI Whisper model for speech recognition	You already have an integration built around whisper-1 and it performs well enough	You are building a new workflow and want the cheapest managed OpenAI option first
gpt-4o-mini-transcribe	Lower-cost OpenAI batch transcription model	Audio is clean, volume is high and small accuracy differences are acceptable	Audio is noisy, names and numbers matter, or you have costly correction workflows
gpt-4o-transcribe	Higher-quality managed OpenAI transcription route	Accuracy matters more than halving the base transcription bill	The audio is simple enough that mini passes your own tests
gpt-realtime-whisper	Realtime speech-to-text model for live transcript deltas	The user needs to see words appear during the session	You are uploading completed files and can wait for a batch transcript
Self-hosted Whisper	Open-source deployment route outside the managed OpenAI API	You have steady volume, GPU capacity and people who can maintain the system	You want low admin overhead or your transcription load is unpredictable

For a new build, start by testing gpt-4o-mini-transcribe and gpt-4o-transcribe against a small but representative audio set. Include good microphones, bad microphones, background noise, overlapping speakers, names, numbers, accents, long files and any specialist terms your users care about. Synthetic demo audio will not reveal where a production transcript fails.

Is Whisper-1 still available in 2026?

Yes. whisper-1 still appears in OpenAI’s model documentation as a general-purpose speech recognition model, and it is still relevant for existing Whisper integrations. The mistake is assuming whisper-1 is the only OpenAI transcription choice.

For existing systems, staying on whisper-1 can be sensible if the cost is predictable and output quality is known. For new systems, compare it against gpt-4o-mini-transcribe and gpt-4o-transcribe before committing. The lower-cost model may be good enough for clean bulk audio, while the higher-quality model may save money indirectly if it reduces manual correction.

OpenAI transcription pricing by monthly usage

Pricing becomes easier to reason about when you map it to real workloads instead of abstract minutes. These examples use base transcription costs only. They do not include downstream summarisation, storage, retries, diarisation, redaction or human review.

Scenario	Minutes per month	Best starting model	Estimated base cost	Why this route makes sense
Small podcast archive	500	gpt-4o-mini-transcribe	$1.50	Clean spoken audio usually makes the cheaper model worth testing first
Research calls	3,000	gpt-4o-mini-transcribe or gpt-4o-transcribe	$9 to $18	Use mini for clean calls and route difficult recordings upward
SaaS transcription feature	30,000	Hybrid mini plus higher-quality fallback	$90 to $180	At this volume, routing rules matter more than picking one model blindly
Live captioning product	30,000 live minutes	gpt-realtime-whisper or a specialist realtime provider	Around $510	Realtime output changes the cost model and the architecture
Enterprise call workflow	300,000	Compare OpenAI, Deepgram, AssemblyAI, Google, AWS and self-hosting	Depends on latency, governance and feature needs	Vendor approval, data controls, diarisation and review costs can outweigh the headline rate

The SaaS example is where many teams make the wrong call. They pick the cheapest transcription API, then run every transcript through expensive summarisation, entity extraction and customer-facing formatting. At scale, the second model call can cost more than transcription itself.

What the base price does not include

The per-minute transcription price is only the visible part of the bill. In a real workflow, several surrounding costs decide whether Whisper is actually cheap.

Speaker diarisation

Plain transcription tells you what was said. Diarisation tries to identify who spoke when. That difference matters for interviews, meeting notes, sales calls, support conversations and podcasts with multiple hosts.

Classic Whisper-style transcription is not the same as a polished meeting transcript with reliable “Speaker 1” and “Speaker 2” labels. You may need a newer speaker-aware OpenAI route, a separate diarisation model, a provider with built-in speaker labelling or post-processing logic. Price that before you compare vendors.

Audio preprocessing

Long video files often need audio extraction. Multi-channel calls may need downmixing. Quiet recordings may need normalisation. Some files need chunking before upload. None of that is difficult for a capable developer using FFmpeg and queues, but it is still engineering work.

Preprocessing is also where cost control can start. A 60-minute recording with 15 minutes of dead air should not always be sent as a 60-minute file. Trim silence where the workflow allows, but test carefully so you do not cut off quiet speech or speaker transitions.

Retries and failed jobs

Retries can quietly double costs if your pipeline resubmits the same file after a timeout. Use job IDs, idempotency logic, transcript status fields and clear failure states. The cheap API call is not the problem. Duplicate jobs, partial records and poor retry handling are the problem.

Downstream AI calls

Most products do not stop at raw text. They generate summaries, chapters, action points, CRM notes, search indexes, redactions, topics or structured JSON. Once you pass the transcript into another AI model, transcription is just the first line item.

When $0.003 per minute is enough

gpt-4o-mini-transcribe should be the first model to test when volume matters and the audio is relatively clean. It is a strong fit for internal voice notes, clear podcast files, lecture recordings, simple dictation, content archives and non-critical searchable transcripts.

The key word is “test”. Do not move production traffic to the cheapest route because a pricing table looks attractive. Test the model against files that include real microphones, real room noise, real accents and real vocabulary. If errors are minor and the transcript is not being used for compliance-sensitive decisions, the $0.003 route can be the best value.

When to pay for the higher transcription model

Use gpt-4o-transcribe when mistakes are expensive. That includes customer-facing subtitles, legal or medical-adjacent notes, technical calls, investor interviews, research transcripts, product names, addresses, numbers and workflows where humans will trust the output without reading the audio again.

A higher transcription bill can be cheaper than a correction workflow. If a transcript error causes a support agent to misread a customer complaint or a researcher to miss a key statement, the saving from mini disappears quickly. This is why the right test is not only word error rate. Track proper nouns, numbers, speaker turns, timestamps, acronyms and domain terms.

When live transcription changes the price comparison

Realtime transcription is not batch transcription with a faster response. It is a different interaction pattern. Live captions, voice agents, call assist tools and browser microphone workflows need transcript deltas while audio is still arriving. That is why gpt-realtime-whisper belongs in the pricing table rather than being treated as a footnote.

Use case	Better pricing comparison	What to test
Uploading a podcast file	gpt-4o-mini-transcribe vs gpt-4o-transcribe	Accuracy, punctuation, names, editing time and turnaround speed
Transcribing meetings after they finish	Batch file transcription plus diarisation options	Speaker labels, timestamps, summary cost and export format
Live captions	gpt-realtime-whisper vs specialist streaming providers	Latency, partial-text corrections, stability and cost per live hour
Voice agents	Realtime transcription plus speech generation and LLM costs	End-to-end latency, barge-in behaviour, turn detection and failure recovery
Live translation	Realtime translation pricing, not batch Whisper pricing	Language pairs, delay tolerance, accuracy and moderation needs

The practical rule is straightforward: use batch pricing when the user can wait; use realtime pricing when the product experience depends on words appearing during the session.

OpenAI Whisper pricing: pros and cons

Pros	Cons
Clear low-cost managed route with gpt-4o-mini-transcribe at about $0.003 per minute.	The cheapest model still needs testing on noisy, accented or domain-heavy recordings.
Strong accuracy profile in DIY AI’s 2026 speech-to-text dataset, with OpenAI Whisper scoring 9.2/10 overall.	Classic whisper-1 is not the newest path for every managed OpenAI transcription workflow.
Good fit when transcripts feed into OpenAI-based summarisation, extraction or classification.	Realtime transcription costs materially more than standard file transcription.
Self-hosted Whisper remains an option for teams that need control and have GPU capacity.	Self-hosting shifts cost into engineering, infrastructure, monitoring and maintenance.
Simple base pricing makes early forecasting easier than many feature-heavy speech platforms.	Diarisation, retries, preprocessing, redaction and downstream AI calls can change the total cost.

Cheaper alternatives to OpenAI Whisper API

The first cheaper alternative is still inside OpenAI: gpt-4o-mini-transcribe. It cuts the familiar $0.006-per-minute benchmark in half for standard transcription, which makes it the obvious first test for clean batch audio.

Outside OpenAI, cheaper does not always mean better. A lower per-hour rate can lose its advantage if you need extra diarisation, redaction, streaming, analytics or human correction. Use the following comparison as a shortlist, then verify live vendor pricing before migration.

Provider	DIY AI overall score	Star rating	Best fit	Pricing pattern to watch
OpenAI Whisper / OpenAI transcription	9.2/10	4.6/5	High-accuracy batch transcription and OpenAI ecosystem workflows	Very competitive batch pricing, higher cost for realtime
Deepgram	9.1/10	4.6/5	Realtime streaming and voice AI agents	Model, streaming and feature choices can change the real bill
AssemblyAI	9.0/10	4.5/5	Speech intelligence, summaries, chapters and structured audio insights	Audio intelligence add-ons can matter more than base transcription
Speechmatics	8.8/10	4.4/5	Global accents, non-native English and multilingual speech	Worth comparing where accent coverage is more important than the lowest rate
Google Gemini Flash STT / Google Speech-to-Text	8.6/10	4.3/5	Google Cloud teams and multimodal workflows	Cloud fit, governance and batch options can outweigh simple per-minute comparisons
AWS Transcribe	8.0/10	4.0/5	AWS-native data pipelines and contact-centre workflows	Regional pricing, add-ons and minimum charges need checking
Rev AI	7.9/10	4.0/5	Hybrid AI and human-reviewed transcription workflows	Useful where human verification matters more than raw API cost

Deepgram deserves early testing when the product is realtime or voice-agent-led. AssemblyAI is stronger when the transcript needs built-in intelligence rather than a plain text file. Speechmatics is a good shortlist option for accents. Google and AWS make sense when the organisation is already committed to those clouds and procurement is easier inside an existing vendor stack.

For a deeper look at live transcription trade-offs, read our Deepgram review. For the Google side of the comparison, see our Google Speech-to-Text review.

OpenAI Whisper vs Deepgram, AssemblyAI, Google and AWS

OpenAI is the best default when you want accurate batch transcripts and the text will feed into other OpenAI workflows. Deepgram is the stronger first test for low-latency streaming. AssemblyAI is often cleaner when you want transcript intelligence without building every post-processing step yourself. Google and AWS are rarely chosen on raw price alone; they are chosen because the wider cloud environment is already approved.

Decision factor	OpenAI transcription	Deepgram	AssemblyAI	Google / AWS
Batch file transcription	Strong default	Strong	Strong	Strong in cloud-native environments
Realtime streaming	Use gpt-realtime-whisper and test latency	Excellent fit	Good fit depending on workflow	Useful for cloud-integrated systems
Speech intelligence	Usually built with additional OpenAI processing	Available through platform features	Core strength	Depends on connected cloud services
Enterprise governance	Good, but check data and retention requirements	Good for speech-first teams	Good for product teams needing audio intelligence	Strongest when procurement already favours the cloud vendor
Lowest simple batch cost	gpt-4o-mini-transcribe is highly competitive	Competitive	Competitive	Can be competitive for dynamic batch or scale tiers, but not always simple

Do not let a pricing table make the whole decision. A transcript that arrives too slowly, misses speaker turns, mangles proper nouns or creates more editing work is not cheap in practice.

Self-hosted Whisper: when it is actually cheaper

Self-hosted Whisper is attractive because there is no per-minute OpenAI API bill. That does not make it free. You still pay for GPU time, queue workers, monitoring, storage, security, model updates, deployment work and developer time.

It can be cheaper when volume is high, demand is steady and the team already understands GPU infrastructure. It is less attractive for a small product with uneven usage, where managed API pricing keeps complexity low and lets developers focus on the product.

A realistic self-hosting estimate needs named assumptions: GPU type, processing speed, utilisation, file length, queue latency, storage costs, engineering time and failure handling. Avoid publishing a single “$0.50” style estimate unless the calculation is fully explained and reproducible.

How to reduce transcription costs without hurting accuracy

The best savings usually come from routing and preprocessing rather than switching vendors every time a price changes. A sensible workflow looks like this:

Trim long silence before upload where it is safe to do so.
Extract audio from video before transcription instead of pushing large video files through your pipeline.
Split long recordings on silence or sentence boundaries, not arbitrary timestamps.
Send clean, low-risk recordings to gpt-4o-mini-transcribe first.
Route noisy, high-value or compliance-sensitive audio to the higher-quality model.
Use job IDs and idempotency checks so retries do not create duplicate transcription costs.
Only run expensive summarisation, redaction or extraction on transcripts that need it.
Keep a small regression set of real audio files and retest when you change models.

That last point matters. Models change, aliases move and provider behaviour can shift. A simple 20-file evaluation set with known problem cases will catch more issues than a beautiful spreadsheet with no audio behind it.

Whisper accuracy and multilingual performance: what you are paying for

OpenAI Whisper remains the top-ranked provider in DIY AI’s 2026 speech-to-text dataset, with a 9.2/10 overall score. The category score is not based on price alone. It reflects accuracy, speed, speaker detection, punctuation, diarisation, noise robustness, export formats, cost efficiency and realtime streaming.

Metric	OpenAI Whisper score	Practical meaning
Accuracy	9.6/10	Strong results on clear speech, interviews, podcasts and many real-world recordings
Noise robustness	9.4/10	Useful when microphone quality, room noise or recording conditions are imperfect
Speaker detection	9.4/10	Strong within the wider OpenAI transcription stack, but still test multi-speaker files
Punctuation	9.2/10	More readable long-form transcripts with less cleanup
Diarisation	9.2/10	Good score, but model and endpoint support must be checked before launch
Cost efficiency	8.8/10	Strong value for batch transcription, especially with the mini model route
Realtime streaming	8.8/10	Better now with realtime options, though dedicated speech platforms remain strong

Accuracy is not evenly distributed across every language, accent and recording condition. High-resource languages usually perform better. Low-resource languages, code-switching, noisy multi-speaker rooms and domain-specific vocabulary need direct testing. For a deeper quality discussion, see our OpenAI Whisper review.

Does ChatGPT Plus include OpenAI Whisper API usage?

No. ChatGPT subscriptions and OpenAI API usage are billed separately. ChatGPT Plus may include voice or file features inside the ChatGPT product, but that is not the same as using the API to transcribe files inside your own app, automation or SaaS workflow.

This is a common source of pricing confusion. A user can pay for ChatGPT Plus and still need separate API billing for programmatic transcription. Treat the API as its own usage-based bill.

OpenAI Whisper pricing FAQs

How much does OpenAI Whisper cost per minute in 2026?

The familiar OpenAI Whisper pricing figure is $0.006 per minute, equal to $0.36 per audio hour. In 2026, you should also compare gpt-4o-mini-transcribe at about $0.003 per minute, gpt-4o-transcribe at about $0.006 per minute and gpt-realtime-whisper at about $0.017 per minute for live transcription.

Is OpenAI Whisper API free?

No, the managed OpenAI API is paid. The open-source Whisper model can be run without OpenAI API fees, but self-hosting still has compute, storage, monitoring and engineering costs. Any trial credits should be treated as temporary evaluation budget, not a long-term free tier.

Is gpt-4o-mini-transcribe cheaper than whisper-1?

Yes, based on the current pricing structure, gpt-4o-mini-transcribe is about $0.003 per minute, while whisper-1 is commonly modelled at $0.006 per minute. Test quality before switching production traffic.

What is the latest OpenAI Whisper model in 2026?

For OpenAI’s managed API, the better 2026 framing is not just “latest Whisper”. Compare whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe and gpt-realtime-whisper. For open-source deployments, teams often compare Whisper variants and optimised implementations separately from the managed OpenAI API.

Does OpenAI bill per second or per minute for Whisper?

For budget planning, model transcription by the duration of audio you upload. If a file is 30 minutes long, forecast against 30 minutes unless your pipeline trims silence or splits the file before upload. Always check the current provider billing rules before launch.

Does OpenAI Whisper charge for silence?

Plan as though uploaded audio duration matters, not only spoken-word duration. If your files contain long gaps, silence trimming can reduce cost, but test carefully so you do not remove quiet speech, pauses that matter, or speaker transitions.

Do retries cost extra?

They can. If your application resubmits the same audio after a timeout or failed status check, you may create duplicate work. Use clear job states, idempotency logic and retry limits.

Does the OpenAI Whisper price include summarisation?

No. Transcription pricing covers the speech-to-text step. Summaries, chapters, action points, CRM notes, redaction, classification and structured extraction usually require separate processing.

Is Deepgram cheaper than OpenAI Whisper?

It depends on the model, streaming mode and features you compare. OpenAI’s gpt-4o-mini-transcribe is very competitive for low-cost batch transcription. Deepgram is often more compelling when realtime streaming and voice-agent latency are the main requirements.

Should I use OpenAI Whisper or self-host Whisper?

Use the managed API when you want low operational overhead, predictable integration and faster shipping. Consider self-hosting when volume is high, privacy or control requirements justify it, and your team can maintain GPU infrastructure.

Verdict: which transcription route should you choose?

OpenAI Whisper pricing is still attractive, but the best answer in 2026 is no longer “$0.006 per minute”. The better decision is to split your workflow into cheap batch, higher-quality batch, live realtime and self-hosted cases.

Use gpt-4o-mini-transcribe first for clean, high-volume batch audio. Use gpt-4o-transcribe when transcript quality matters more than shaving fractions of a cent per minute. Keep whisper-1 where existing integrations are stable and there is no strong reason to migrate. Use gpt-realtime-whisper only when live transcript deltas matter to the product experience. Compare Deepgram early for voice agents and realtime speech products. Consider self-hosted Whisper only when the operational burden is justified by scale or control.

The page-level takeaway is simple: price the full workflow. Audio minutes are only the starting point. Diarisation, preprocessing, failed jobs, transcript cleanup, downstream AI calls and human correction decide whether the cheapest line in the table is actually the cheapest route.

Best AI Speech To Text Tools 2026

By: Steven Jones On: December 2, 2025

Updated on: May 17, 2026

The best AI speech-to-text tools in 2026 are no longer judged by word error rate alone. Accuracy still matters, but…

OpenAI Whisper Review 2026

By: Steven Jones On: February 13, 2026

Updated on: May 22, 2026

OpenAI Whisper remains one of the most important speech-to-text systems in 2026, especially for teams that want high accuracy, open-source…

Deepgram Pricing 2026

By: Steven Jones On: June 10, 2026

Deepgram pricing in 2026 is usage-based, with separate rates for speech-to-text, real-time streaming, Flux, Aura text-to-speech, Voice Agent API usage,…

Writer: Steven Jones

AI Tools Reviewer and Technical Analyst

Steven Jones is a technology analyst specialising in artificial intelligence, machine learning workflows, and emerging automation tools. At DIY AI, he focuses on clear, practical guidance for people comparing AI tools in the real world. His work covers text generation, image generation, video tools, data platforms, developer-focused AI products, and the automation workflows that connect them. Steven's reviews are built around hands-on testing, practical benchmarks, and transparent scoring rather than vendor claims. He looks closely at where each tool performs well, where it falls short, and what those trade-offs mean for creators, teams, and businesses trying to make sensible AI adoption decisions. He has a particular interest in safety, reliability, output quality, performance metrics, and dataset quality. When he is not reviewing the latest AI model updates, he experiments with prompt engineering techniques and contributes to DIY AI ongoing work on fair, explainable scoring frameworks for AI tools.

Contact