Salesforce sued by authors for training AI on thousands of pirated books: Report

Two authors have filed a proposed class action in San Francisco, alleging Salesforce has used pirated books to train its xGen models.

Friday October 17, 2025 , 2 min Read

Salesforce has been hit with a proposed class action in the United States after two novelists have alleged the company has used thousands of pirated books, including their own, to train its xGen language models without permission.

The complaint was filed on 15 October 2025 in the U.S. District Court for the Northern District of California by authors E. Molly Tanzer and Jennifer Gilmore, and accuses the cloud software company of copyright infringement.

The lawsuit has alleged Salesforce has copied and used large datasets — including EleutherAI’s The Pile and the RedPajama corpus — that contain the Books3 collection of approximately 196,000 titles scraped from a private ebook tracker, to develop and train its xGen models.

The filing stated the authors’ works sit within those corpora and that Salesforce has continued to store, process and use them, seeking damages, attorneys’ fees and destruction of infringing copies.

Case details and legal posture

The action was brought in the Northern District of California, San Francisco Division, and has sought class certification on behalf of similarly situated authors.

The plaintiffs, represented by the Joseph Saveri Law Firm, requested statutory and other damages and an order requiring Salesforce to dispose of copies made or used in violation of their rights.

Salesforce’s engineering blog described xGen as trained on “2 trillion+ tokens” drawn from public sources such as Common Crawl and GitHub, and characterised its pre‑training dataset as “legally compliant” following collaboration with the company’s legal and ethics teams.

Those claims have been cited in the complaint as inconsistent with the alleged use of shadow‑library book datasets.

Advertise with us