You are viewing a single thread.
View all comments View context
3 points
*

Books3 is the definition of “not publicly available” because it’s all from pirated material downloaded from private torrent tracker Bibliotik.

Books3 is literally why several of AI groups are being sued by various authors like Sarah Silverman and George R.R. Martin.

Books3 was always illicitly obtained material which put into question whether an LLM using it could really fall under Fair Use. (It most likely does, but it’s still a legal question that hasn’t been answered yet.)

Books3 Link: https://huggingface.co/datasets/the_pile_books3

Books3 Description from Link:

This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.

permalink
report
parent
reply

Technology

!technology@lemmy.ml

Create post

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

Community stats

  • 3.7K

    Monthly active users

  • 2.6K

    Posts

  • 41K

    Comments

Community moderators