Insight
The 2026 Compliance Landscape: Why "Can We Scrape It?" Became "Can We Scrape It Responsibly?"
Eighteen months ago, enterprise data teams asked one question before launching a scraping project: will it work? In 2026, the first question is will it hold up in court?
The shift is not theoretical. The EU AI Act's high-risk system requirements take effect on August 2, 2026, with penalties reaching €35 million or 7% of global revenue. GDPR enforcement has now surpassed €5.88 billion in cumulative fines since 2018, with 2025 alone accounting for €2.3 billion — a 38% year-over-year increase. In the United States, Google filed a federal lawsuit against SerpApi in December 2025, alleging DMCA violations for circumventing its SearchGuard anti-bot system. Reddit sued SerpApi, Oxylabs, and Perplexity AI two months earlier for similar conduct.
The message from courts, regulators, and platforms is now unanimous: data collection at scale requires a compliance architecture, not just a technical one. This guide breaks down what changed, what the law actually says, and what your procurement and engineering teams need to implement before enforcement catches up.
Key Legal Developments: EU AI Act, Google v. SerpApi, and the hiQ Legacy
The EU AI Act: Data Governance Meets Scraping Infrastructure
The EU AI Act does not regulate web scraping directly. It regulates what happens after the data is collected. Starting August 2, 2026, any organization deploying high-risk AI systems — spanning employment decisions, credit scoring, education, and law enforcement — must demonstrate full compliance with data governance requirements, technical documentation standards, and transparency obligations.
For data teams that scrape the web to train models or feed AI pipelines, three provisions matter most:
Training data disclosure — AI providers must disclose data sources and respect copyright opt-outs under the EU Copyright Directive.
Transparency rules (Article 50) — AI-generated content must be labeled, and systems interacting with humans must disclose that fact. Both provisions become enforceable in August 2026.
GPAI model obligations — Providers of general-purpose AI models face enforcement powers and fines starting August 2, 2026, including penalties up to 3% of worldwide annual turnover or €15 million for copyright-related violations.
The practical impact: if your AI web scraping pipeline feeds a model deployed in the EU, the provenance of every dataset becomes auditable. "We scraped it from public sources" is no longer a sufficient answer.
Google v. SerpApi: The DMCA Meets Anti-Bot Enforcement
On December 19, 2025, Google filed suit against SerpApi in the Northern District of California, alleging violations of DMCA Section 1201 — the anti-circumvention provision. The core claim: SerpApi bypassed Google's SearchGuard system (launched January 2025) to scrape hundreds of millions of search result pages daily, then resold the data through a paid API.
Google's complaint details that SerpApi's automated queries increased by 25,000% over two years. SearchGuard, which Google describes as costing "tens of thousands of person hours and millions of dollars," was designed to block exactly this kind of automated access. SerpApi allegedly circumvented it by masking bots as human users, syndicating authorization tokens, and misrepresenting device and location data.
The case is significant because Google is not relying on traditional copyright infringement claims alone. The DMCA framing means the method of access — bypassing a technological protection measure — is itself the violation. If Google prevails, it establishes that anti-bot systems like SearchGuard qualify as DMCA-protected access controls, making circumvention illegal regardless of whether the underlying data is "public."
The hiQ v. LinkedIn Legacy
The hiQ Labs v. LinkedIn saga ended in December 2022 with a settlement: hiQ agreed to a permanent injunction, $500,000 in damages, and destruction of all scraped data and source code. But the precedent that survived is instructive.
The Ninth Circuit twice held that scraping publicly available data does not violate the Computer Fraud and Abuse Act (CFAA). The Supreme Court vacated and remanded the case for reconsideration after its Van Buren decision, but the Ninth Circuit reaffirmed its position in April 2022. The CFAA's "without authorization" standard, the court found, does not apply to data that anyone with a web browser can see.
However, the district court ultimately ruled that hiQ violated LinkedIn's User Agreement through automated scraping and fake profile creation. The takeaway: scraping public data may not be a federal crime, but it can absolutely be a breach of contract. Platforms increasingly enforce their Terms of Service as the primary legal weapon against scrapers.
Case | Filed | Core Legal Theory | Status (Feb 2026) |
|---|---|---|---|
Google v. SerpApi | Dec 2025 | DMCA §1201 (circumvention of SearchGuard) | Active — pre-trial |
Reddit v. SerpApi et al. | Oct 2025 | DMCA + ToS violation | Active — pre-trial |
Ziff Davis v. OpenAI | 2025 | DMCA (robots.txt as TPM) | Partially dismissed |
hiQ v. LinkedIn | 2017 | CFAA + breach of contract | Settled — $500K + injunction |
Reddit v. Anthropic | 2025 | Breach of ToS (robots.txt) | Active — pre-trial |
Robots.txt in 2026: From Suggestion to Legal Signal
For decades, robots.txt was treated as a polite request — a gentleman's agreement between site owners and crawlers. That framing is collapsing, but not in the way most people assume.
In late 2025, Judge Sidney H. Stein of the Southern District of New York ruled in Ziff Davis v. OpenAI that robots.txt files do not constitute "technological measures that effectively control access" under the DMCA. Ziff Davis had argued that OpenAI's bots ignoring their robots.txt amounted to anti-circumvention. The court disagreed: robots.txt lacks the technical enforcement characteristics the DMCA requires.
But that is not the end of the story. In Reddit v. Anthropic, Reddit's lead claim is breach of its Terms of Service — a contract theory, not a copyright one. Reddit argues that its ToS explicitly prohibits scraping and that robots.txt serves as one layer of that prohibition. This approach avoids the Ziff Davis problem entirely by shifting the legal theory from DMCA circumvention to contract breach.
For enterprises, the practical guidance is clear:
Robots.txt is not legally binding on its own, but ignoring it destroys good-faith arguments in court.
Terms of Service are enforceable, especially when a scraper has actual knowledge of them (as courts found in the Meta v. Bright Data ruling in 2024).
GDPR enforcement reinforces robots.txt — European data protection authorities increasingly view robots.txt violations as evidence of non-compliance with data minimization principles.
The French CNIL published updated guidance in June 2025 specifically addressing web scraping for AI development, confirming that legitimate interest under GDPR requires documented, proportionate justification — and that ignoring site-owner preferences undermines that justification.
How Anti-Bot Enforcement Changed the Rules
The technical landscape has shifted as dramatically as the legal one. In July 2025, Cloudflare announced that it would block AI bots by default for all customers, including free-tier users. Given that approximately 20% of websites use Cloudflare, this single decision reshaped the scraping ecosystem overnight.
Cloudflare's Turnstile — a CAPTCHA replacement that uses behavioral analysis and browser fingerprinting — now dynamically adjusts challenge difficulty based on traffic patterns. Google's SearchGuard, deployed in January 2025, uses similar principles to distinguish human users from automated scrapers. Both systems represent a new class of anti-bot technology: adaptive, ML-driven, and designed to make circumvention itself a legal liability (as the Google v. SerpApi case demonstrates).
For data collection teams, this creates a fork in the road:
Approach | Legal Risk | Technical Sustainability | Compliance Posture |
|---|---|---|---|
Circumvent anti-bot systems | High — DMCA §1201 exposure | Low — constant cat-and-mouse | Indefensible |
Use shared residential proxy pools | Medium — IP sourcing questions | Medium — pool quality degrades | Questionable |
Use dedicated mobile proxy hardware | Low — real carrier IPs, no circumvention | High — stable, auditable | Strong |
Official APIs / data licensing | Lowest — contractual access | Highest — sanctioned access | Optimal |
The trend line is unmistakable: the anti-bot arms race is becoming a legal arms race. Circumventing protections is no longer just an engineering challenge — it is a litigation risk.
Proxy Sourcing Ethics: Shared Pools vs. Dedicated Hardware
Not all proxies carry the same compliance burden. How your proxy vendor sources its IP addresses matters as much as what you do with them.
The proxy industry broadly divides into three sourcing models:
SDK/proxyware pools — Residential IPs sourced through apps installed on consumer devices. The user theoretically "consents" by accepting terms embedded in another application. Consent chains are often opaque, and privacy risks are well-documented.
Shared residential pools — Large pools of IPs aggregated from multiple sources. Compliance varies wildly. Without visibility into how each IP was acquired, enterprises inherit their vendor's sourcing risk.
Dedicated bare-metal hardware — Physical devices connected to real mobile carriers, owned and operated by the proxy provider. No consumer devices are involved, no consent chains are needed, and the IP provenance is fully auditable.
From a GDPR perspective, the distinction is critical. European Data Protection Authorities have fined companies for scraping-related violations totaling over €50 million in Clearview AI cases alone. Poland fined a data broker €220,000 for scraping public business registries. France's CNIL fined KASPR €200,000 for scraping LinkedIn profiles. In every case, the regulators scrutinized not just what data was collected but how the collection infrastructure was sourced and operated.
For enterprises running data collection at scale, dedicated infrastructure eliminates the largest category of sourcing risk. When the proxy hardware is owned by the provider — real 5G SIM cards in physical devices connected to carrier networks — there is no ambiguity about consent, no third-party SDK in the chain, and no residential user whose privacy is at stake. This is the model Illusory operates: bare-metal mobile proxy infrastructure with dedicated hardware per customer, real carrier IPs, and full auditability.
Sourcing Model | Consent Clarity | GDPR Risk | Audit Trail | Enterprise Suitability |
|---|---|---|---|---|
SDK / Proxyware | Opaque — buried in ToS | High | Minimal | Low |
Shared Residential Pool | Varies by sub-vendor | Medium-High | Limited | Medium |
Dedicated Bare-Metal Hardware | N/A — no consumer IPs | Low | Full | High |
Enterprise Compliance Checklist: What Procurement Teams Should Ask Proxy Vendors
Legal and procurement teams evaluating proxy vendors should treat the selection process like any other third-party risk assessment. The following checklist covers the questions that matter in 2026:
IP Sourcing and Ethics
How are residential or mobile IPs acquired? Is there explicit, informed consent from device owners?
Does the provider use SDK-based acquisition through bundled consumer apps?
Can the vendor produce a documented consent chain for any IP address in their pool?
Does the vendor hold ISO 27001, SOC 2, or equivalent security certifications?
Data Protection and GDPR
What personal data does the vendor collect from proxy network participants?
Is collected data minimized, encrypted, and time-bound per GDPR requirements?
Does the vendor have a published Data Processing Agreement (DPA)?
How does the vendor handle Data Subject Access Requests (DSARs)?
Usage Policies and Monitoring
Does the vendor enforce an Acceptable Use Policy (AUP) with a list of prohibited use cases?
Does the vendor actively monitor traffic patterns for abuse?
Are KYC (Know Your Customer) procedures in place for all customers?
What escalation paths exist for disputes or takedown requests?
Infrastructure Transparency
Is the proxy infrastructure shared or dedicated per customer?
Can the vendor demonstrate the physical provenance of its IP addresses (carrier, device type, location)?
Does the vendor provide API access with full logging and audit capabilities?
Failure to ask these questions turns your proxy vendor into a latent liability. Courts and regulators have made it clear: outsourcing data collection does not outsource compliance responsibility. If your vendor's IP sourcing is unethical, the enforcement action lands on your desk, not theirs.
Building Audit-Friendly Data Collection Pipelines
Compliance is not a checkbox exercise — it is an architectural decision. The strongest legal posture comes from pipelines that are designed to be audited, not retrofitted after a subpoena arrives.
A compliant data collection pipeline in 2026 includes these components:
Data provenance logging — Record the source URL, timestamp, proxy IP used, and robots.txt status for every request. This creates the audit trail regulators and courts expect.
Robots.txt compliance layer — Automatically check and respect robots.txt before every crawl. Log all directives encountered and decisions made. This is not legally required, but it is your strongest good-faith evidence.
PII detection and filtering — Implement automated scanning to identify and quarantine personal data before it enters downstream systems. Under GDPR, scraping personal data requires a lawful basis — "we did not know it was there" is not a defense.
Rate limiting and politeness controls — Aggressive scraping that degrades site performance can constitute a denial-of-service attack. Courts have ruled against scrapers who damaged server infrastructure.
Vendor audit integration — Your proxy provider should expose API-level logging that integrates with your compliance tooling. If your mobile proxy vendor cannot produce request logs on demand, that is a red flag.
Retention and deletion policies — Define how long scraped data is stored and implement automated deletion. GDPR's data minimization principle applies to scraped data identically to any other personal data.
The French CNIL's June 2025 guidance on web scraping for AI development explicitly states that legitimate interest under Article 6 of the GDPR requires a documented and concrete assessment of necessity and proportionality — with tailored safeguards. Organizations that cannot produce this documentation face enforcement risk regardless of whether the scraped data was technically "public."
Frequently Asked Questions
Is web scraping legal in 2026?
It depends on jurisdiction, method, and data type. In the United States, scraping publicly accessible data does not violate the CFAA (per the Ninth Circuit's hiQ v. LinkedIn rulings), but circumventing anti-bot systems may violate the DMCA (per Google v. SerpApi). Breaching a site's Terms of Service can result in contract liability. In the EU, scraping personal data requires a lawful basis under GDPR, and the AI Act adds disclosure and governance requirements for AI training data. There is no blanket "scraping is legal" or "scraping is illegal" answer — compliance is determined by how you scrape, what you scrape, and what you do with it.
Does the EU AI Act directly regulate web scraping?
Not directly. The AI Act regulates what happens after data collection — requiring training data disclosure, copyright opt-out compliance, and data governance for high-risk systems. However, scraping that feeds AI systems deployed in the EU must meet these downstream requirements, which effectively makes the collection method and data provenance auditable. Enforcement begins August 2, 2026, with penalties up to €35 million or 7% of global revenue.
Is ignoring robots.txt illegal?
Robots.txt is not legally binding on its own, and a federal court ruled in Ziff Davis v. OpenAI (2025) that it does not qualify as a DMCA-protected technological measure. However, ignoring robots.txt can undermine good-faith defenses in court, and combined with Terms of Service violations, it can support breach-of-contract claims. Under GDPR, European regulators view robots.txt compliance as evidence of data minimization and purpose limitation.
What should enterprises look for in a compliant proxy vendor?
Transparent IP sourcing (no opaque SDK consent chains), security certifications (ISO 27001, SOC 2), a published Acceptable Use Policy, KYC procedures, dedicated infrastructure per customer, and API-level logging. If a vendor cannot explain where their IPs come from or produce an audit trail for your requests, they are a compliance liability.
How does GDPR apply to scraping "public" data?
GDPR applies to all personal data regardless of whether it is publicly accessible. The "public" nature of data may support a legitimate interest argument under Article 6, but it does not eliminate obligations around transparency (Articles 12-14), data minimization, and purpose limitation. The CNIL fined KASPR €200,000 for scraping publicly available LinkedIn profiles. Poland fined a data broker €220,000 for scraping public business registries. "It was public" is not a compliance strategy.
Build Your Compliance-First Data Pipeline
The enterprises that avoid enforcement actions in 2026 will not be the ones that scraped less — they will be the ones that scraped responsibly, with auditable infrastructure, transparent proxy sourcing, and documented compliance at every layer.
Illusory provides bare-metal 5G mobile proxy infrastructure built for exactly this standard: dedicated hardware per customer, real carrier IPs from physical devices, instant IP rotation, unlimited API access, and full request logging. No shared pools. No SDK consent chains. No third-party residential devices in the mix.
View pricing or review our Terms of Service — and choose the proxy vendor your legal team will actually approve.
Latest Blogs
