Perplexing Agents: Redefining the Boundaries of Online Agency

Authors:

Siddharth Yadav
Ambika Sondhi

As AI agents surpass human activity online, the clash between companies like Perplexity and Cloudflare highlights the urgent need to redefine norms of online agency, protect website security, and establish enforceable international standards for responsible AI behaviour.

Disagreements and legal actions regarding copyright infringement and intellectual property theft have proliferated since the dawn of large language models (LLMs) in 2022. While LLMs offer unprecedented capabilities to streamline and automate core economic sectors from coding to public service delivery, rapid advancements in AI capabilities have challenged regulators to keep pace with the emergent capabilities of frontier models. The scramble to regulate LLMs is set to become increasingly complicated with AI agents on the algorithmic horizon.

Agents are AI systems that offer additional layers of capabilities on top of LLMs, such as executing user commands in virtual environments or edge devices, accessing and retrieving user-requested information from websites, and building and deploying software by automating multi-stage decision-making and execution. While LLMs use a mix of user-generated and synthetic data to provide textual and audio-visual output to user queries, agents are programmed to take actions in the digital space. Traditionally, algorithmic tools such as ‘crawlers’ are deployed to collect data from websites en masse. While agents may be similar to crawlers in the function they perform, they differ in purpose. Crawlers are fundamentally designed to collect information for building databases. Conversely, AI agents are designed to retrieve specific data based on user inputs and can automate and execute entire workflows. While such novel capabilities may be paving the way for a paradigm shift in users’ experience on the internet, they also raise questions regarding the future of an internet increasingly populated by computational entities. A key set of questions has arisen due to a legal battle between the AI company Perplexity and web security solutions provider Cloudflare.

In August 2025, Cloudflare accused Perplexity of allowing its AI agents to scrape data from websites that explicitly prohibit data scraping by crawlers. In response, Perplexity argued that agents are different from crawlers in that agents should be seen as extensions of the users themselves. Consequently, an agent accessing a website following a user query should be treated as a user, not a crawler. The difference between AI agents and crawlers may seem pedantic, but such definitional ambiguity can evolve into a fundamental obstacle for AI development and adoption — issues that have become strategic goals for nations across the world.

The Problem with Digitising Agency

The internet has long operated on norms of trust and openness, thanks to the mutually beneficial relationship between search engines and websites. The sheer number of websites on the internet limits the ability of search engines to catalogue and feature all of them. Websites looking to increase their viewership and subsequent potential advertising opportunities have done so by granting search engine optimisation (SEO) crawlers access to their data. Websites use machine-readable files, encoded as “robots.txt.files,” to demarcate no-crawl data zones and protect specific HTML pages from being featured on search engines. These machine-readable files are, functionally, requests for privacy. This places trust in SEO crawlers to not bypass these security measures, which in turn have been generally respectful of website boundaries. This symbiotic relationship reflects norms of internet trust and a consensus on the marginal utility gains of open access practices.

The last few years have seen an influx of agentic AI crawlers using website data for two purposes: to train large language models (LLMs) and to retrieve user-requested data. Agentic AI crawlers have been known to act deceptively, circumventing blockers, ignoring machine-readable files and overwhelming websites in order to gather as much data as possible. Agentic AI crawlers feed on websites parasitically, benefitting from the same websites upon which they impose heavy bandwidth costs and from which they divert potential traffic.

Agentic AI crawlers have been known to act deceptively, circumventing blockers, ignoring machine-readable files and overwhelming websites in order to gather as much data as possible.

At its core, the disruptive behaviour of agents concerns permission to access websites. The legal dispute between Perplexity and Cloudflare has sparked a contentious debate over whether an agent acting on behalf of a human should be treated as a human or a bot. While reasonable arguments can be made for either case, the risks posed by the agency of AI agents remain nuanced. Cybersecurity tools used by websites generally use signature-based bot detection protocols that can be confounded by LLM-based agents that are able to reason and exploit weaknesses in novel ways. Emergent AI capabilities allow for crawlers to adopt behaviour that does not resemble traditional pre-programmed bot behaviour. Consequently, multi-agent swarms can emulate human behaviour since they can be deployed through browsers or virtual environments hosted in edge devices owned by humans. Accelerating the development and adoption of AI agents poses two problems. First, it fundamentally erodes the monetisation strategies of smaller and open-source domains. Second, agents introduce new cyberattack vectors through emergent capabilities and by employing layers of obfuscation to mask their identities.

The Way Ahead: A Two-Pronged Approach

The internet has reached an inflexion point this year with the arrival of AI agents — for the first time in history, bot activity has superseded total human activity online. This phase shift is creating a scenario where “AI agents, rather than humans, are the primary consumers of content.” While this digital infestation is a challenge to web protocols, it is simultaneously a natural consequence of heavy investments in AI development. As AI regulation becomes a cornerstone of national strategies to drive economic competitiveness, slow uptake or adoption of AI products and services can prove to be a bottleneck for countries prioritising digital transformation. However, accelerating adoption will worsen the situation due to ballooning bandwidth costs and plummeting advertising revenue generation for website or domain owners caused by the surge in AI traffic.

As AI regulation becomes a cornerstone of national strategies to drive economic competitiveness, slow uptake or adoption of AI products and services can prove to be a bottleneck for countries prioritising digital transformation.

As agentic AI crawlers cannot be trusted to respect internet security norms, the cost of protecting copyrighted and private data falls on individual website owners. Websites will be inclined towards using anti-crawling software, making access harder for agentic and SEO crawlers alike. Smaller website owners who are unable to afford sophisticated cybersecurity programmes are more likely to hide their content behind paywalls and subscription costs, while others may take their content offline. Following current trends, open source code and open access information will likely cease to be easily accessible. AI crawlers will also be disadvantaged by the scarcity of new training content and information retrieval in a closed and distrusting internet ecosystem.

A two-pronged approach to protecting website cybersecurity and encouraging norms of transparency can be adopted. Governments should create databases of websites that opt out of AI crawling and require mandatory compliance reports from AI companies, which can be cross-checked against these databases. Furthermore, governments should subsidise cybersecurity programmes such as Cloudflare’s Labyrinth, making them available to participating websites at lower cost.

Using anonymous data from cybersecurity programmes and cross-checked compliance reports, AI companies that bypass cybersecurity measures and stated boundaries should be penalised. As a basic standard-setting method, AI companies deploying agents should be mandated to adopt best practices, such as disallowing agents from changing their autonomous system number (ASN) and user agent strings (which identify a browser and system to a server) to mask their identities when confronted with a network block. Finally, multilateral forums for instituting international norms against deceptive and malicious agentic behaviour — such as the Robot Exclusion Protocol introduced in the 1990s and standardised under the Internet Engineering Task Force in 2022 — should be promoted to enforce and track the efficacy of cybersecurity protocols.

Siddharth Yadav is a Fellow with the Technology vertical at the ORF Middle East.

Ambika Sondhi is an independent researcher with a focus on international security.

Authors

Siddharth Yadav

Siddharth Yadav is a Fellow in Technology with an academic background in history, literature and cultural studies. He acquired BA (Hons) and MA in History from the University of Delhi followed by an MA in Cultural Studies of Asia, Africa, and the Middle East from SOAS, University of London. Subsequently, he completed his doctoral research...