Harvard University, in collaboration with Google, has released a dataset of 1 million public domain books to train the next generation of artificial intelligence.
The books include genres, languages, and authors such as Dickens, Dante, and Shakespeare that are no longer protected by copyright due to their age. The new dataset initiative comes as AI training data is naturally expensive and is better suited to big-money tech companies.
Harvard University gets financial support from tech giants
According to an article published by TechCrunch, the initiative is being led by Harvard University’s Institutional Data Initiative (IDI). The initiative contains books sourced from Google’s old book scanning project, Google Books.
Other books in the dataset include Czech mathematics books and Welsh pocket dictionaries.
The university teased the IDI initiative back in March, clearly outlining its plans to create a “trusted channel for legal AI data.” Since then, little has been heard about it until Thursday’s official launch, with tech giants Microsoft and OpenAI funding the project.
The dataset isn't exclusive to Silicon Valley alone, but IDI has opened it up to anyone, from research labs to AI startups looking to train their own large language models.
By opening the dataset to anyone, he said, the dataset aims to level the playing field, at a time when the cost of training AI remains prohibitive for small companies and the preserve of companies with huge budgets.
Lippert added that the data set is "carefully reviewed," which according to Fudzilla means that someone has checked to make sure Bard is actually gone and out of the way.
The Harvard dataset will need more resources.
According to Lippert, who compared the dataset’s potential to Linux, the open-source operating system, the success of the Harvard dataset will depend on a number of variables. Its success will require more resources, expertise and “a sprinkle of magic” from the same deep-pocketed companies the initiative was designed to challenge, Lippert said.
The million books in the dataset were scanned as part of the Google Books program. Fudzilla describes the initiative as a digital time capsule from when Google’s ambitions to scan every book seemed outlandish rather than dystopian.
However, Lippert is optimistic about the project's potential uses, also suggesting that it could be a treasure trove to help train AI models for everyone from garage startups to corporate conglomerates.
While some have hailed the initiative as a revolutionary leap forward in the democratization of AI, Fudzilla argues that some may see it as a subtle way to ensure that any ambitious startup with a few terabytes of server space can now compete in the race to develop the next ChatGPT.
However, they will need more resources to compete and make a dent in the market. ChatGPT was launched in November 2022 and was an instant success, spurring a race for generative AI models around the world. However, the development of these models has created a thirst for data to improve them, and this desire for more data has created issues about how much information they can get, without stealing it.
So far, publishers like the Wall Street Journal and the New York Times have sued OpenAI and Perplexity for using their data without permission.