OpenAI Whisper API Pricing 2026: Cost Per Minute, GPT-4o Transcribe and Cheaper Alternatives
OpenAI Whisper API pricing in 2026 is no longer a single “$0.006 per minute” answer. That rate still matters for whisper-1 and gpt-4o-transcribe, but OpenAI now also lists gpt-4o-mini-transcribe at about $0.003 per minute and gpt-realtime-whisper at about $0.017 per minute for live transcription.
That changes the buying decision. A podcast archive, meeting recorder, subtitle workflow, research library or SaaS transcription feature should not be priced from the headline rate alone. The real cost depends on batch versus realtime transcription, model choice, file duration, silence trimming, speaker diarisation, retries, transcript cleanup and any downstream summarisation.
This updated guide gives the price answer first, then shows how to model monthly costs, when to use each OpenAI transcription route, how OpenAI compares with Deepgram, AssemblyAI, Google and AWS, and where cheaper options make sense. The provider scores use DIY AI’s 2026 speech-to-text dataset, the same framework used in our best AI speech-to-text tools guide.
OpenAI Whisper API pricing: the quick answer
As of June 2026, the practical OpenAI transcription pricing range is about $0.003 to $0.006 per minute for standard file transcription, and about $0.017 per minute for gpt-realtime-whisper when you need live transcript deltas. Always verify the live rate on the official OpenAI API pricing page before putting numbers into a client quote, procurement document or production margin model.
| OpenAI transcription option | Current price to model | Estimated hourly cost | Best for | Main trade-off |
|---|---|---|---|---|
| whisper-1 | $0.006/minute | $0.36/hour | Existing Whisper integrations and simple file transcription | Older model path, not the lowest-cost OpenAI option |
| gpt-4o-mini-transcribe | $0.003/minute | $0.18/hour | Lowest-cost managed OpenAI transcription for clean batch audio | Needs testing on noisy calls, accents and technical vocabulary |
| gpt-4o-transcribe | $0.006/minute | $0.36/hour | Higher-quality managed transcription inside the OpenAI API | Twice the base cost of the mini transcription route |
| gpt-realtime-whisper | $0.017/minute | $1.02/hour | Live speech-to-text, captions and transcript deltas | Much more expensive than batch transcription |
| Self-hosted Whisper | Infrastructure-dependent | Infrastructure-dependent | High-volume teams with GPU capacity and engineering support | You own scaling, queues, monitoring, failure recovery and model updates |
The important point is not just that mini is cheaper. It is that OpenAI now splits the decision into cheap batch transcription, higher-quality batch transcription, legacy Whisper compatibility and realtime transcription. Those are different workflows, not four names for the same job.
OpenAI Whisper cost calculator
*Self-hosted estimates based on $0.03/hr GPU processing speed.
Monthly transcription cost = uploaded audio minutes x model price per minute.
| Audio volume | Cost at $0.003/minute | Cost at $0.006/minute | Cost at $0.017/minute | Practical meaning |
|---|---|---|---|---|
| 100 minutes | $0.30 | $0.60 | $1.70 | Small tests, demos and occasional uploads |
| 500 minutes | $1.50 | $3.00 | $8.50 | Small podcast archive or research notes |
| 3,000 minutes | $9.00 | $18.00 | $51.00 | Regular calls, interviews or internal meeting transcription |
| 30,000 minutes | $90.00 | $180.00 | $510.00 | SaaS feature or serious content workflow |
| 300,000 minutes | $900.00 | $1,800.00 | $5,100.00 | Enterprise-scale usage where routing and vendor choice matter |
For a one-hour file, the quick mental maths is easy: $0.18 on gpt-4o-mini-transcribe, $0.36 on whisper-1 or gpt-4o-transcribe, and $1.02 if you model the same hour as gpt-realtime-whisper. That does not include storage, audio conversion, diarisation, retries, human review or later AI analysis.
Whisper-1 vs GPT-4o Transcribe vs GPT-Realtime-Whisper
A lot of people still search for “OpenAI Whisper pricing” because Whisper made OpenAI transcription popular. The API decision in 2026 is broader.
| Model or route | Role in 2026 | Use it when | Avoid it when |
|---|---|---|---|
| whisper-1 | Legacy OpenAI Whisper model for speech recognition | You already have an integration built around whisper-1 and it performs well enough | You are building a new workflow and want the cheapest managed OpenAI option first |
| gpt-4o-mini-transcribe | Lower-cost OpenAI batch transcription model | Audio is clean, volume is high and small accuracy differences are acceptable | Audio is noisy, names and numbers matter, or you have costly correction workflows |
| gpt-4o-transcribe | Higher-quality managed OpenAI transcription route | Accuracy matters more than halving the base transcription bill | The audio is simple enough that mini passes your own tests |
| gpt-realtime-whisper | Realtime speech-to-text model for live transcript deltas | The user needs to see words appear during the session | You are uploading completed files and can wait for a batch transcript |
| Self-hosted Whisper | Open-source deployment route outside the managed OpenAI API | You have steady volume, GPU capacity and people who can maintain the system | You want low admin overhead or your transcription load is unpredictable |
For a new build, start by testing gpt-4o-mini-transcribe and gpt-4o-transcribe against a small but representative audio set. Include good microphones, bad microphones, background noise, overlapping speakers, names, numbers, accents, long files and any specialist terms your users care about. Synthetic demo audio will not reveal where a production transcript fails.
Is Whisper-1 still available in 2026?
Yes. whisper-1 still appears in OpenAI’s model documentation as a general-purpose speech recognition model, and it is still relevant for existing Whisper integrations. The mistake is assuming whisper-1 is the only OpenAI transcription choice.
For existing systems, staying on whisper-1 can be sensible if the cost is predictable and output quality is known. For new systems, compare it against gpt-4o-mini-transcribe and gpt-4o-transcribe before committing. The lower-cost model may be good enough for clean bulk audio, while the higher-quality model may save money indirectly if it reduces manual correction.
OpenAI transcription pricing by monthly usage
Pricing becomes easier to reason about when you map it to real workloads instead of abstract minutes. These examples use base transcription costs only. They do not include downstream summarisation, storage, retries, diarisation, redaction or human review.
| Scenario | Minutes per month | Best starting model | Estimated base cost | Why this route makes sense |
|---|---|---|---|---|
| Small podcast archive | 500 | gpt-4o-mini-transcribe | $1.50 | Clean spoken audio usually makes the cheaper model worth testing first |
| Research calls | 3,000 | gpt-4o-mini-transcribe or gpt-4o-transcribe | $9 to $18 | Use mini for clean calls and route difficult recordings upward |
| SaaS transcription feature | 30,000 | Hybrid mini plus higher-quality fallback | $90 to $180 | At this volume, routing rules matter more than picking one model blindly |
| Live captioning product | 30,000 live minutes | gpt-realtime-whisper or a specialist realtime provider | Around $510 | Realtime output changes the cost model and the architecture |
| Enterprise call workflow | 300,000 | Compare OpenAI, Deepgram, AssemblyAI, Google, AWS and self-hosting | Depends on latency, governance and feature needs | Vendor approval, data controls, diarisation and review costs can outweigh the headline rate |
The SaaS example is where many teams make the wrong call. They pick the cheapest transcription API, then run every transcript through expensive summarisation, entity extraction and customer-facing formatting. At scale, the second model call can cost more than transcription itself.
What the base price does not include
The per-minute transcription price is only the visible part of the bill. In a real workflow, several surrounding costs decide whether Whisper is actually cheap.
Speaker diarisation
Plain transcription tells you what was said. Diarisation tries to identify who spoke when. That difference matters for interviews, meeting notes, sales calls, support conversations and podcasts with multiple hosts.
Classic Whisper-style transcription is not the same as a polished meeting transcript with reliable “Speaker 1” and “Speaker 2” labels. You may need a newer speaker-aware OpenAI route, a separate diarisation model, a provider with built-in speaker labelling or post-processing logic. Price that before you compare vendors.
Audio preprocessing
Long video files often need audio extraction. Multi-channel calls may need downmixing. Quiet recordings may need normalisation. Some files need chunking before upload. None of that is difficult for a capable developer using FFmpeg and queues, but it is still engineering work.
Preprocessing is also where cost control can start. A 60-minute recording with 15 minutes of dead air should not always be sent as a 60-minute file. Trim silence where the workflow allows, but test carefully so you do not cut off quiet speech or speaker transitions.
Retries and failed jobs
Retries can quietly double costs if your pipeline resubmits the same file after a timeout. Use job IDs, idempotency logic, transcript status fields and clear failure states. The cheap API call is not the problem. Duplicate jobs, partial records and poor retry handling are the problem.
Downstream AI calls
Most products do not stop at raw text. They generate summaries, chapters, action points, CRM notes, search indexes, redactions, topics or structured JSON. Once you pass the transcript into another AI model, transcription is just the first line item.
When $0.003 per minute is enough
gpt-4o-mini-transcribe should be the first model to test when volume matters and the audio is relatively clean. It is a strong fit for internal voice notes, clear podcast files, lecture recordings, simple dictation, content archives and non-critical searchable transcripts.
The key word is “test”. Do not move production traffic to the cheapest route because a pricing table looks attractive. Test the model against files that include real microphones, real room noise, real accents and real vocabulary. If errors are minor and the transcript is not being used for compliance-sensitive decisions, the $0.003 route can be the best value.
When to pay for the higher transcription model
Use gpt-4o-transcribe when mistakes are expensive. That includes customer-facing subtitles, legal or medical-adjacent notes, technical calls, investor interviews, research transcripts, product names, addresses, numbers and workflows where humans will trust the output without reading the audio again.
A higher transcription bill can be cheaper than a correction workflow. If a transcript error causes a support agent to misread a customer complaint or a researcher to miss a key statement, the saving from mini disappears quickly. This is why the right test is not only word error rate. Track proper nouns, numbers, speaker turns, timestamps, acronyms and domain terms.
When live transcription changes the price comparison
Realtime transcription is not batch transcription with a faster response. It is a different interaction pattern. Live captions, voice agents, call assist tools and browser microphone workflows need transcript deltas while audio is still arriving. That is why gpt-realtime-whisper belongs in the pricing table rather than being treated as a footnote.
| Use case | Better pricing comparison | What to test |
|---|---|---|
| Uploading a podcast file | gpt-4o-mini-transcribe vs gpt-4o-transcribe | Accuracy, punctuation, names, editing time and turnaround speed |
| Transcribing meetings after they finish | Batch file transcription plus diarisation options | Speaker labels, timestamps, summary cost and export format |
| Live captions | gpt-realtime-whisper vs specialist streaming providers | Latency, partial-text corrections, stability and cost per live hour |
| Voice agents | Realtime transcription plus speech generation and LLM costs | End-to-end latency, barge-in behaviour, turn detection and failure recovery |
| Live translation | Realtime translation pricing, not batch Whisper pricing | Language pairs, delay tolerance, accuracy and moderation needs |
The practical rule is straightforward: use batch pricing when the user can wait; use realtime pricing when the product experience depends on words appearing during the session.
OpenAI Whisper pricing: pros and cons
| Pros | Cons |
|---|---|
| Clear low-cost managed route with gpt-4o-mini-transcribe at about $0.003 per minute. | The cheapest model still needs testing on noisy, accented or domain-heavy recordings. |
| Strong accuracy profile in DIY AI’s 2026 speech-to-text dataset, with OpenAI Whisper scoring 9.2/10 overall. | Classic whisper-1 is not the newest path for every managed OpenAI transcription workflow. |
| Good fit when transcripts feed into OpenAI-based summarisation, extraction or classification. | Realtime transcription costs materially more than standard file transcription. |
| Self-hosted Whisper remains an option for teams that need control and have GPU capacity. | Self-hosting shifts cost into engineering, infrastructure, monitoring and maintenance. |
| Simple base pricing makes early forecasting easier than many feature-heavy speech platforms. | Diarisation, retries, preprocessing, redaction and downstream AI calls can change the total cost. |
Cheaper alternatives to OpenAI Whisper API
The first cheaper alternative is still inside OpenAI: gpt-4o-mini-transcribe. It cuts the familiar $0.006-per-minute benchmark in half for standard transcription, which makes it the obvious first test for clean batch audio.
Outside OpenAI, cheaper does not always mean better. A lower per-hour rate can lose its advantage if you need extra diarisation, redaction, streaming, analytics or human correction. Use the following comparison as a shortlist, then verify live vendor pricing before migration.
| Provider | DIY AI overall score | Star rating | Best fit | Pricing pattern to watch |
|---|---|---|---|---|
| OpenAI Whisper / OpenAI transcription | 9.2/10 | 4.6/5 | High-accuracy batch transcription and OpenAI ecosystem workflows | Very competitive batch pricing, higher cost for realtime |
| Deepgram | 9.1/10 | 4.6/5 | Realtime streaming and voice AI agents | Model, streaming and feature choices can change the real bill |
| AssemblyAI | 9.0/10 | 4.5/5 | Speech intelligence, summaries, chapters and structured audio insights | Audio intelligence add-ons can matter more than base transcription |
| Speechmatics | 8.8/10 | 4.4/5 | Global accents, non-native English and multilingual speech | Worth comparing where accent coverage is more important than the lowest rate |
| Google Gemini Flash STT / Google Speech-to-Text | 8.6/10 | 4.3/5 | Google Cloud teams and multimodal workflows | Cloud fit, governance and batch options can outweigh simple per-minute comparisons |
| AWS Transcribe | 8.0/10 | 4.0/5 | AWS-native data pipelines and contact-centre workflows | Regional pricing, add-ons and minimum charges need checking |
| Rev AI | 7.9/10 | 4.0/5 | Hybrid AI and human-reviewed transcription workflows | Useful where human verification matters more than raw API cost |
Deepgram deserves early testing when the product is realtime or voice-agent-led. AssemblyAI is stronger when the transcript needs built-in intelligence rather than a plain text file. Speechmatics is a good shortlist option for accents. Google and AWS make sense when the organisation is already committed to those clouds and procurement is easier inside an existing vendor stack.
For a deeper look at live transcription trade-offs, read our Deepgram review. For the Google side of the comparison, see our Google Speech-to-Text review.
OpenAI Whisper vs Deepgram, AssemblyAI, Google and AWS
OpenAI is the best default when you want accurate batch transcripts and the text will feed into other OpenAI workflows. Deepgram is the stronger first test for low-latency streaming. AssemblyAI is often cleaner when you want transcript intelligence without building every post-processing step yourself. Google and AWS are rarely chosen on raw price alone; they are chosen because the wider cloud environment is already approved.
| Decision factor | OpenAI transcription | Deepgram | AssemblyAI | Google / AWS |
|---|---|---|---|---|
| Batch file transcription | Strong default | Strong | Strong | Strong in cloud-native environments |
| Realtime streaming | Use gpt-realtime-whisper and test latency | Excellent fit | Good fit depending on workflow | Useful for cloud-integrated systems |
| Speech intelligence | Usually built with additional OpenAI processing | Available through platform features | Core strength | Depends on connected cloud services |
| Enterprise governance | Good, but check data and retention requirements | Good for speech-first teams | Good for product teams needing audio intelligence | Strongest when procurement already favours the cloud vendor |
| Lowest simple batch cost | gpt-4o-mini-transcribe is highly competitive | Competitive | Competitive | Can be competitive for dynamic batch or scale tiers, but not always simple |
Do not let a pricing table make the whole decision. A transcript that arrives too slowly, misses speaker turns, mangles proper nouns or creates more editing work is not cheap in practice.
Self-hosted Whisper: when it is actually cheaper
Self-hosted Whisper is attractive because there is no per-minute OpenAI API bill. That does not make it free. You still pay for GPU time, queue workers, monitoring, storage, security, model updates, deployment work and developer time.
It can be cheaper when volume is high, demand is steady and the team already understands GPU infrastructure. It is less attractive for a small product with uneven usage, where managed API pricing keeps complexity low and lets developers focus on the product.
A realistic self-hosting estimate needs named assumptions: GPU type, processing speed, utilisation, file length, queue latency, storage costs, engineering time and failure handling. Avoid publishing a single “$0.50” style estimate unless the calculation is fully explained and reproducible.
How to reduce transcription costs without hurting accuracy
The best savings usually come from routing and preprocessing rather than switching vendors every time a price changes. A sensible workflow looks like this:
- Trim long silence before upload where it is safe to do so.
- Extract audio from video before transcription instead of pushing large video files through your pipeline.
- Split long recordings on silence or sentence boundaries, not arbitrary timestamps.
- Send clean, low-risk recordings to gpt-4o-mini-transcribe first.
- Route noisy, high-value or compliance-sensitive audio to the higher-quality model.
- Use job IDs and idempotency checks so retries do not create duplicate transcription costs.
- Only run expensive summarisation, redaction or extraction on transcripts that need it.
- Keep a small regression set of real audio files and retest when you change models.
That last point matters. Models change, aliases move and provider behaviour can shift. A simple 20-file evaluation set with known problem cases will catch more issues than a beautiful spreadsheet with no audio behind it.
Whisper accuracy and multilingual performance: what you are paying for
OpenAI Whisper remains the top-ranked provider in DIY AI’s 2026 speech-to-text dataset, with a 9.2/10 overall score. The category score is not based on price alone. It reflects accuracy, speed, speaker detection, punctuation, diarisation, noise robustness, export formats, cost efficiency and realtime streaming.
| Metric | OpenAI Whisper score | Practical meaning |
|---|---|---|
| Accuracy | 9.6/10 | Strong results on clear speech, interviews, podcasts and many real-world recordings |
| Noise robustness | 9.4/10 | Useful when microphone quality, room noise or recording conditions are imperfect |
| Speaker detection | 9.4/10 | Strong within the wider OpenAI transcription stack, but still test multi-speaker files |
| Punctuation | 9.2/10 | More readable long-form transcripts with less cleanup |
| Diarisation | 9.2/10 | Good score, but model and endpoint support must be checked before launch |
| Cost efficiency | 8.8/10 | Strong value for batch transcription, especially with the mini model route |
| Realtime streaming | 8.8/10 | Better now with realtime options, though dedicated speech platforms remain strong |
Accuracy is not evenly distributed across every language, accent and recording condition. High-resource languages usually perform better. Low-resource languages, code-switching, noisy multi-speaker rooms and domain-specific vocabulary need direct testing. For a deeper quality discussion, see our OpenAI Whisper review.
Does ChatGPT Plus include OpenAI Whisper API usage?
No. ChatGPT subscriptions and OpenAI API usage are billed separately. ChatGPT Plus may include voice or file features inside the ChatGPT product, but that is not the same as using the API to transcribe files inside your own app, automation or SaaS workflow.
This is a common source of pricing confusion. A user can pay for ChatGPT Plus and still need separate API billing for programmatic transcription. Treat the API as its own usage-based bill.
OpenAI Whisper pricing FAQs
How much does OpenAI Whisper cost per minute in 2026?
The familiar OpenAI Whisper pricing figure is $0.006 per minute, equal to $0.36 per audio hour. In 2026, you should also compare gpt-4o-mini-transcribe at about $0.003 per minute, gpt-4o-transcribe at about $0.006 per minute and gpt-realtime-whisper at about $0.017 per minute for live transcription.
Is OpenAI Whisper API free?
No, the managed OpenAI API is paid. The open-source Whisper model can be run without OpenAI API fees, but self-hosting still has compute, storage, monitoring and engineering costs. Any trial credits should be treated as temporary evaluation budget, not a long-term free tier.
Is gpt-4o-mini-transcribe cheaper than whisper-1?
Yes, based on the current pricing structure, gpt-4o-mini-transcribe is about $0.003 per minute, while whisper-1 is commonly modelled at $0.006 per minute. Test quality before switching production traffic.
What is the latest OpenAI Whisper model in 2026?
For OpenAI’s managed API, the better 2026 framing is not just “latest Whisper”. Compare whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe and gpt-realtime-whisper. For open-source deployments, teams often compare Whisper variants and optimised implementations separately from the managed OpenAI API.
Does OpenAI bill per second or per minute for Whisper?
For budget planning, model transcription by the duration of audio you upload. If a file is 30 minutes long, forecast against 30 minutes unless your pipeline trims silence or splits the file before upload. Always check the current provider billing rules before launch.
Does OpenAI Whisper charge for silence?
Plan as though uploaded audio duration matters, not only spoken-word duration. If your files contain long gaps, silence trimming can reduce cost, but test carefully so you do not remove quiet speech, pauses that matter, or speaker transitions.
Do retries cost extra?
They can. If your application resubmits the same audio after a timeout or failed status check, you may create duplicate work. Use clear job states, idempotency logic and retry limits.
Does the OpenAI Whisper price include summarisation?
No. Transcription pricing covers the speech-to-text step. Summaries, chapters, action points, CRM notes, redaction, classification and structured extraction usually require separate processing.
Is Deepgram cheaper than OpenAI Whisper?
It depends on the model, streaming mode and features you compare. OpenAI’s gpt-4o-mini-transcribe is very competitive for low-cost batch transcription. Deepgram is often more compelling when realtime streaming and voice-agent latency are the main requirements.
Should I use OpenAI Whisper or self-host Whisper?
Use the managed API when you want low operational overhead, predictable integration and faster shipping. Consider self-hosting when volume is high, privacy or control requirements justify it, and your team can maintain GPU infrastructure.
Verdict: which transcription route should you choose?
OpenAI Whisper pricing is still attractive, but the best answer in 2026 is no longer “$0.006 per minute”. The better decision is to split your workflow into cheap batch, higher-quality batch, live realtime and self-hosted cases.
Use gpt-4o-mini-transcribe first for clean, high-volume batch audio. Use gpt-4o-transcribe when transcript quality matters more than shaving fractions of a cent per minute. Keep whisper-1 where existing integrations are stable and there is no strong reason to migrate. Use gpt-realtime-whisper only when live transcript deltas matter to the product experience. Compare Deepgram early for voice agents and realtime speech products. Consider self-hosted Whisper only when the operational burden is justified by scale or control.
The page-level takeaway is simple: price the full workflow. Audio minutes are only the starting point. Diarisation, preprocessing, failed jobs, transcript cleanup, downstream AI calls and human correction decide whether the cheapest line in the table is actually the cheapest route.

