# ============================================================== # Soap Media — robots.txt # https://www.soapmedia.co.uk # Last updated: 2026-05-12 # Strategy: Allow citation/search bots, block training crawlers, # allow social previews, defensive admin/build blocks. # Review quarterly — AI crawler landscape changes fast. # ============================================================== # ============================================================ # SEARCH ENGINES + AI ANSWER/CITATION BOTS — Allow # These drive search rankings and AI answer visibility. # Documented separation from training crawlers below. # ============================================================ User-agent: Googlebot User-agent: Googlebot-Image User-agent: Googlebot-News User-agent: Bingbot User-agent: Applebot User-agent: DuckDuckBot User-agent: DuckAssistBot User-agent: Slurp User-agent: OAI-SearchBot User-agent: ChatGPT-User User-agent: Claude-SearchBot User-agent: Claude-User User-agent: PerplexityBot User-agent: Perplexity-User User-agent: MistralAI-User User-agent: YouBot User-agent: Meta-ExternalAgent User-agent: Meta-ExternalFetcher User-agent: cohere-ai Allow: / Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /wp-json/ Disallow: /api/ Disallow: /_nuxt/builds/ Disallow: /preview/ Disallow: /draft/ Disallow: /thank-you/ Disallow: /thanks/ Disallow: /search? Disallow: /?s= Disallow: /*?utm_ Disallow: /*?gclid= Disallow: /*?fbclid= Disallow: /*?msclkid= # ============================================================ # SOCIAL MEDIA LINK PREVIEW BOTS — Allow # Required for rich preview cards when URLs shared on social. # Without these, shared Soap links render as bare URLs. # ============================================================ User-agent: Twitterbot User-agent: facebookexternalhit User-agent: facebot User-agent: LinkedInBot User-agent: Pinterestbot User-agent: TelegramBot User-agent: WhatsApp User-agent: Discordbot User-agent: SkypeUriPreview Allow: / # ============================================================ # AI TRAINING CRAWLERS — Block # Evidence (BuzzStream Mar 2026, 4M citations): blocking these # reduces AI citation rates by ~7–12% only. Trade-off favours # protecting Soap's case-study + methodology content moat from # extraction-with-no-referral (Anthropic crawls 38,000 pages per # referral; OpenAI similar). Search/retrieval bots above remain # allowed and drive the citations. # ============================================================ # OpenAI training crawler (distinct from OAI-SearchBot above) User-agent: GPTBot Disallow: / # Anthropic training crawler (distinct from Claude-SearchBot/Claude-User above) User-agent: ClaudeBot User-agent: anthropic-ai User-agent: Claude-Web Disallow: / # Google Gemini training opt-out (does NOT affect Search ranking # or AI Overview citation per Google docs Apr 2025) User-agent: Google-Extended Disallow: / # Apple Intelligence training opt-out (does NOT affect Applebot # search visibility) User-agent: Applebot-Extended Disallow: / # Common Crawl — historically supplied paywalled content to AI # vendors without publisher consent (PPC Land, Nov 2025) User-agent: CCBot Disallow: / # Amazon AI training User-agent: Amazonbot Disallow: / # ByteDance training (powers TikTok/Doubao recommendations) # Documented to ignore directives — Cloudflare WAF is the real # enforcement; this is symbolic intent. User-agent: Bytespider User-agent: Bytedance Disallow: / # ============================================================ # SEO TOOL CRAWLERS — Allow (matches Cloudflare WAF policy) # Soap uses these in-house and clients run them against Soap. # AhrefsBot already allowlisted at Cloudflare WAF — keep robots.txt # consistent. # ============================================================ User-agent: AhrefsBot User-agent: SemrushBot User-agent: MJ12bot User-agent: DotBot User-agent: MojeekBot Allow: / # Yandex — minor UK traffic but allowed (matches Cloudflare verified bots) User-agent: YandexBot Allow: / # ============================================================ # GENERIC LIBRARIES / SCRAPERS — Disallow # These typically ignore robots.txt anyway. Real enforcement is # at Cloudflare WAF. This is symbolic but documents intent. # ============================================================ User-agent: Go-http-client User-agent: python-requests User-agent: Scrapy User-agent: curl User-agent: Wget Disallow: / # ============================================================ # CATCH-ALL — Default behaviour for any bot not listed above # Permissive by default. Block admin/system paths. # No Crawl-delay (Google ignores it; modern bots self-throttle). # ============================================================ User-agent: * Allow: / Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /wp-json/ Disallow: /api/ Disallow: /_nuxt/builds/ Disallow: /preview/ Disallow: /draft/ Disallow: /cgi-bin/ Disallow: /thank-you/ Disallow: /thanks/ Disallow: /search? Disallow: /?s= Disallow: /*?utm_ Disallow: /*?gclid= Disallow: /*?fbclid= Disallow: /*?msclkid= # ============================================================ # SITEMAPS # ============================================================ Sitemap: https://www.soapmedia.co.uk/sitemap.xml
Following keywords were found. You can check the keyword optimization of this page for each keyword.
(Nice to have)