Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/69408af5-1bf8-800f-b5dc-86cd01e07ec0
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: Short answer: no, frontier labs are not simply ignoring infringement risk and scraping everything indiscriminately, but… === Short answer: no, frontier labs are not simply ignoring infringement risk and scraping everything indiscriminately, but they are also not limited to the same ruleset people intuitively imagine. What changed after the infringement claims is how data is sourced, processed, and justified, not whether models continue to scale. Below is the clean breakdown. ==== Large post-2023 increases in token counts come primarily from: ==== ===== - Models generate their own training material: paraphrases, explanations, step-by-step reasoning, code variants, question–answer pairs ===== * Synthetic data can be orders of magnitude larger than human-written corpora * This data is legally clean because it is model-generated, not copied This is now a major component of frontier training. ===== - The same underlying datasets are: - Re-tokenized - Reweighted - Re-mixed - Re-paraphrased - Re-sequenced ===== * A single document can yield many distinct training samples without adding new source material Token count ≠ number of unique documents. ===== - Explicit licensing deals (publishers, news orgs, code repositories, image libraries) ===== * Proprietary internal corpora (Google, Microsoft, Amazon have massive first-party text, logs, docs, manuals) * Paid human-created datasets at scale These sources expanded substantially after lawsuits began. ==== Frontier labs are relying heavily on transformative fair use, specifically: ==== * Training ≠ storing or redistributing text * Models do not retain retrievable copies of works * Outputs are probabilistic transformations, not reproductions This position has not been conclusively tested at Supreme Court level, but: * Multiple lower-court rulings * Decades of precedent from search engines, indexing, and ML currently favor this interpretation. Labs are betting that training is legally analogous to reading, not copying. ==== Compared to early GPT-3 era scraping: ==== * Copyright filters are stricter * Deduplication is aggressive * Memorization tests are routine * Regurgitation benchmarks are enforced * Opt-out signals are increasingly honored The irony: models today are less likely to verbatim reproduce copyrighted text than earlier ones, despite being larger. ==== Modern token counts include: ==== * Image tokens * Audio tokens * Video frame tokens * Sensorial embeddings * Structured metadata These are often: * Licensed * First-party * Synthetic * Or weakly copyrightable So “15T tokens” is not “15T copyrighted paragraphs.” ==== They are not reckless, but they are accepting uncertainty because: ==== * The upside of frontier capability is enormous * The legal landscape is unsettled * Waiting for perfect clarity means losing the race * Precedent historically favors transformative technologies This is similar to: * Search engines indexing the web * DVRs recording TV * Web browsers caching pages * Music sampling before clear licensing norms In each case, companies moved first, courts caught up later. ==== Because data scale is no longer bottlenecked by raw copyrighted text. ==== The bottlenecks now are: * Compute * Energy * Alignment * Data quality, not quantity Once synthetic + licensed pipelines matured, token counts could continue growing without proportional legal exposure. ==== - No, labs are not just scraping everything and hoping for the best ==== * Yes, they are relying on fair use, licensing, synthetic data, and transformation * Token growth today does not mean proportional growth in copyrighted human writing * The legal fight is real, but the technical strategy has already adapted around it If you want, next we can: * Separate copyright risk by data type * Analyze whether fair use will actually hold up * Compare AI training to music sampling law * Or examine what a “worst case legal loss” would realistically force labs to do Just say the direction.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)