Transform Your Content
Into Anything

Pick a target — any audio, video, or document. It gets converted to text through transcription or extraction, then combined with prompts to generate summaries, audio, images, music, and video.

Choose a Target

Your target is processed into text through transcription or document extraction. The text is then combined with prompts to generate text, audio, image, music, and video outputs.

Audio / Video

Podcast feed or episode, YouTube/Twitch/TikTok URL, MP3/MP4 file from your computer, or a direct URL to any public audio/video file.

Document

PDF, EPUB, PPTX, or DOCX. Documents are backed up to S3 and extracted using LlamaParse or Mistral OCR vision models.

Industry-Leading Transcription

Choose from 10+ transcription services. Audio longer than 10 minutes is automatically split into segments with proper timestamp tracking and combined into a single transcript.

With Speaker Diarization

Identify who said what with speaker labels and timestamps. Services include HappyScribe, AssemblyAI, Deepgram Nova-3, Soniox, Rev, Gladia, ElevenLabs Scribe, Fal Whisper, and Lemonfox.

Fast Transcription

Get rapid transcripts without speaker identification using Groq Whisper Large V3 Turbo or DeepInfra Whisper. Optimized for speed when speaker labels aren't needed.

Document Extraction

Extract text from PDFs, images, and documents using LlamaParse for multi-page documents or Mistral OCR for vision-based text extraction.

Structured Output from 4 LLM Providers

Generate summaries, chapters, FAQs, takeaways, and more using OpenAI GPT-4, Claude, Gemini, or Groq. All providers support structured JSON output with automatic retry and fallback logic.

Content Generation

Create short summaries (180 chars), long summaries, bullet points, key takeaways, chapters with timestamps, FAQ sections, and custom prompt outputs.

Automatic Fallback

3-attempt retry logic with provider fallback. If your selected model fails, AutoShow automatically tries alternate models and providers to ensure completion.

AI-Powered Media Generation

Go beyond transcription. Generate narrated audio, cover images, original music, and video clips from your content using the latest generative AI models.

Text-to-Speech

Convert summaries to narrated audio using OpenAI TTS, ElevenLabs, or Groq. Choose from multiple voices and output formats (WAV/MP3). OpenAI supports custom voice instructions.

AI Image Generation

Create cover art, thumbnails, and promotional images using OpenAI DALL-E (gpt-image-1.5), Gemini, or MiniMax. Generate 1-3 images per job with customizable dimensions and aspect ratios.

Music Generation

Generate original theme music with AI-written lyrics. Choose from 7 genres: Pop, Rock, Rap, Country, Folk, Jazz, Electronic. Powered by Eleven Music or MiniMax Music.

Video Generation

Create explainer clips, highlights, intros, outros, and social media videos. Use OpenAI Sora (4-12s), Gemini Veo (up to 4K), or MiniMax Hailuo. All prompts include safety filtering.

8-Step Processing Pipeline

Content flows through a configurable pipeline. Each step can be customized with different providers and models. Optional steps are skipped if not enabled.

1

Download and Extract

Audio extracted and converted to 16kHz WAV and 32k MP3. Documents backed up to S3. Video URLs passed directly to transcription.

2

Transcribe

Audio transcribed with timestamps. Long files auto-split into 10-minute segments. Documents extracted to markdown.

3

Build Prompts

Dynamic prompts assembled with metadata, transcript, and selected output types. JSON schemas generated for structured output.

4

Generate Content

LLM generates structured summaries, chapters, FAQs, and more. Automatic retry with provider fallback.

5

Text-to-Speech

Optional: Convert text output to narrated audio. Upload to S3 for persistent access.

6

Generate Images

Optional: Create AI images from title and text output. Multiple image types supported per job.

7

Generate Music

Optional: AI writes genre-specific lyrics, then composes original theme music (up to 3 minutes).

8

Generate Video

Optional: AI writes scene descriptions, then renders video clips (4-12 seconds) with thumbnails.

Technical Specifications

Built for reliability and scale with enterprise-grade infrastructure.

Transcription Services

HappyScribe, AssemblyAI, Deepgram, Soniox, Rev, Gladia, ElevenLabs Scribe, Fal, Lemonfox, Groq Whisper, DeepInfra Whisper, Supadata, deAPI.

LLM Providers

OpenAI (GPT-4o, GPT-4o-mini), Anthropic Claude (Sonnet, Haiku), Google Gemini (2.0 Flash, 1.5 Pro), Groq (Llama, Mixtral).

TTS Services

OpenAI TTS (gpt-4o-mini-tts, coral voice), ElevenLabs (eleven_flash_v2_5), Groq (canopylabs/orpheus-v1).

Image Generation

OpenAI DALL-E (gpt-image-1.5), Gemini (gemini-2.5-flash-image), MiniMax (image-01). Dimensions up to 1536x1024.

Music Generation

Eleven Music (music_v1), MiniMax Music (music-2.5). 7 genres available.

Video Generation

OpenAI Sora (sora-2, sora-2-pro), Gemini Veo (veo-3.1, up to 4K), MiniMax Hailuo (Hailuo-2.3). Durations 4-12 seconds.

Cloud Storage

S3-compatible storage with presigned URLs. All media automatically uploaded for persistent access. Supports Railway Storage Buckets.

Document Processing

LlamaParse (PDF/DOCX to markdown), Mistral OCR (vision-based extraction). Supports PDF, DOCX, PNG, JPG, TIFF, TXT.

Frequently Asked Questions

AutoShow supports video files (MP4, MOV, AVI), audio files (MP3, WAV, M4A), YouTube URLs, streaming URLs, direct file URLs, and documents (PDF, DOCX, PNG, JPG, TIFF, TXT).

For speaker identification, use HappyScribe, AssemblyAI, Deepgram, or ElevenLabs Scribe. For fastest results without speaker labels, use Groq Whisper or DeepInfra Whisper. HappyScribe is required for YouTube and streaming URLs.

There's no hard limit. Audio longer than 10 minutes is automatically split into segments with timestamp tracking. Each segment is transcribed separately and results are combined. Very long content (3+ hours) may take several minutes to process.

OpenAI (GPT-4o, GPT-4o-mini), Anthropic Claude (Sonnet, Haiku), Google Gemini (2.0 Flash, 1.5 Pro), and Groq (Llama, Mixtral). All support structured JSON output. If your selected provider fails, AutoShow automatically falls back to alternatives.

First, an LLM generates a detailed scene description based on your content. Then, the scene is rendered using OpenAI Sora (4-12s clips), Gemini Veo (up to 4K resolution), Runway Veo (4-8s), MiniMax Hailuo, or Grok. Video types include explainer, highlight, intro, outro, and social clips.

All files are saved locally in timestamped output directories. If S3 storage is configured (Railway Storage Buckets or any S3-compatible service), media files are also uploaded with presigned URLs for persistent access.

AutoShow supports 11 genres: rap, rock, pop, country, folk, jazz, ambient, electronic, cinematic, techno, and lofi. An LLM first writes original, copyright-safe lyrics tailored to your content, then the music is composed.

Start Processing Today

Transform your content with AI transcription, summarization, and generation.

Usage based pricing - No subscriptions or hidden fees