Why Reddit is Blocking the Internet Archive for AI Scraping (and its $200M Plan)

Monday, August 11, 2025

Why Reddit is Blocking the Internet Archive for AI Scraping (and its $200M Plan)

Reddit is now restricting the Internet Archive (IA) from indexing its popular threads, a move prompted by Reddit's discovery that some AI companies were circumventing its scraping policies. Instead of scraping Reddit directly, these firms were allegedly pulling data from the IA's archived content. In response, Reddit is now limiting the Wayback Machine to only archive screenshots of its homepage. This means the archive will no longer be a comprehensive record of deleted posts, user activity, or various subreddit cultures. This restriction significantly reduces the IA's utility as a resource for documenting and preserving Reddit's vast content.

Reddit's spokesperson, Tim Rathschmidt, suggested that the Internet Archive could take steps to better protect against AI scraping. Rathschmidt also cited privacy concerns, stating that the Wayback Machine problematically archives content that users have deleted. He noted that until the IA can better "respect user privacy" and comply with Reddit's policies regarding deleted content, the restrictions would remain in place. Although some Redditors previously used the Wayback Machine to find deleted comments, many others have pointed out that several alternative tools exist for this purpose.

It's probable that Reddit's decision is also financially motivated. The company has recently made lucrative licensing deals with AI companies like OpenAI and Google, with the Google deal reportedly worth $60 million over three years. Reddit expects to earn over $200 million from similar licensing agreements in the next three years. By limiting the Internet Archive's access, Reddit may be aiming to force AI firms to strike similar deals directly with them, rather than getting their data for free from the Wayback Machine.

Conclusion

This situation highlights a growing tension between content platforms, AI companies, and digital archivists. While Reddit claims its primary concern is user privacy and protecting against policy violations, the financial incentives from licensing deals with AI firms are a significant factor. For the Internet Archive, this presents a challenge to its mission of preserving the open web, as a major source of online discourse is now largely inaccessible for archiving. The ongoing discussions between Reddit and the Internet Archive will likely determine the future of how Reddit's content is preserved and accessed outside of its platform.

You can learn more about this topic by visiting these articles from The Verge and Ars Technica.

Note this content was summarized by Google Gemini