A queryable corpus of (almost) all the news in the world


I do many things, and sometimes the thing I do is work with researcher and journalist friends on various projects. Some of those projects tend to involve news articles: analysis of news articles is vital to many things. From propaganda studies, to media analysis, or just to find cultural trends, news articles — especially in large quantities — are a good source of insight.

I will say that I have, over the course of half a decade, written or helped to write scrapers for a variety of news websites. For example, I contributed various scrapers for Italian news sites to the fundus framework, a piece of software that I really like working with. Some months ago, though, I thought I'd had enough of writing hack-ish code, and I decided that it was maybe time to solve the issue of news-article dataset procurement, possibly forever and for every (or at least most) edge case. So I did it.

First of all, I happened to know that there is a corpus of all the webpages of the internet (more or less) continuously scraped and well compiled into huge WARC files (like ZIP, but fancy and optimized for HTML), the Common Crawl. At some point I had also stumbled upon the fact that the people who make it also curate a subset of the crawl containing only news articles, CC-NEWS — something that is apparently not very well advertised. At time of writing, the main page documenting its existence is this 2016 announcement of its creation on the Common Crawl blog (maybe they want journalists covering them not to notice).

CC-NEWS, though, is huge: the scrapes from 2016 to 2026 (covering all ten years of its existence) total more than 100 TB of HTML when uncompressed. In its raw form, this dataset is pretty difficult for normal users to download, let alone query. Usually only a specific set of articles is needed for an analysis, and scavenging for them through 30k files named CC-NEWS-20240627144043-04810.warc.gz and the like would be rather inconvenient. While thinking about how to handle that, I got reminded of another amazing thing I'd found some months before: infini-gram. Infini-gram is a research paper (and code implementation) of something that feels like magic: it lets you search keywords or sets of keywords (of arbitrary length!) in a corpus the size of an LLM pretraining dataset in sub-linear time — usually in the order of milliseconds. To achieve that, indexes are calculated and stored for querying; the indexes themselves are big, in the order of terabytes, but they are still far faster and easier to serve via something like a web API. The chart just below is that API, live: type any word (or three) and it sweeps all ten years of the corpus in a single query, then click a point to read real headlines from that year.

Figure 1. Keyword frequency across the corpus, year by year. Each curve is a single infini-gram‑mini find() over all 1.36 billion articles, split into years by the index's shard map — an exact count.1 Warm queries return in milliseconds; the first after an idle spell takes a few seconds (the real latency is reported under the chart). Toggle linear/log and raw counts vs. per‑million‑articles; click any point for real headlines from that year. Watch covid go vertical in 2020 and ukraine spike in 2022.

After connecting the dots and making a very bad prototype in one night, I made a more concrete plan and then started implementing it with Kirill, a colleague at my research lab who thought that this could be, if not a good idea, at least a fun project. He also usually needs commercial datasets for news research, so he got excited about replacing those with an open option.

In short, what we wanted to achieve was: a version of the corpus cleaned and enriched with useful per-article information — things downstream people care about — plus a set of infini-gram indexes (infini-gram mini indexes, actually, the newer implementation that cuts down their size) so the whole corpus could be queried fast.

I had worked with big datasets before (like this one!) but nothing of this size and scope. I tried to pick a stack rooted in what I knew from the LLM-pretraining-dataset processing literature, partly because I could learn a thing or two with some practice. In the end we picked trafilatura to clean the articles and pull out metadata (author, publish date), GlotLID for its coverage, CommonLingua for its performance, and lingua for its short-text optimization — all for language tagging, each with its perks — and this RoBERTa model for tagging article topics, optimized for multilingual articles using the standard International Press Telecommunications Council categories. The sampler below draws a real random article and shows exactly these enrichments attached to it — hit “draw another” a few times to get a feel for the corpus.

Pulling a random article out of the corpus…
Figure 2. One real article, pulled live from a random point in the corpus, shown with the enrichment sidecar the pipeline writes for every document: detected language, the publishing site, author, source URL and crawl metadata.

Technically, downloading, cleaning and enriching the corpus was not too straightforward, but luckily our lab cluster provided more than enough storage, CPUs and GPUs. After optimizing all the steps and parallelizing what was possible, I found one of the biggest bottlenecks to be read/write speed: our big storage is an NFS mount and its hardware is apparently faulty; this has caused quite a few headaches and one major crisis (the whole mount stopped working the night before a deadline, three times).

Figure 3. Exact monthly composition of the corpus, by detected language and by IPTC topic — every one of the 1.36 billion articles counted, computed once with DuckDB straight over the parquet (not sampled).1 Toggle language/topic and count/share; hover for that month's breakdown. English's share of the corpus visibly shrinks as it grows; switch to by topic to watch health and conflict coverage swell around 2020.

The end result is this corpus, the biggest and most complete news dataset readily available on Hugging Face (and with already ~30k monthly downloads despite virtually no publicity), and its sister indexes, for super-fast querying. We have also built an API that lets end users — researchers and journalists — query the dataset and create subdatasets within seconds, without technical expertise. We are currently working with our university to figure out how to deploy it in a way that doesn't violate its cybersecurity policies. In the meantime the public endpoint behind every live widget in this post is, quite literally, a Raspberry Pi with an external hard drive bolted on, sitting on a desk — so the charts here are real but unhurried: give a cold query a few seconds and it will answer. For those interested in the technical details, here is the preprint we wrote with our PI Jana Lasser to present the dataset to the scientific community.

how many times this appears across the corpus

The exact call that reproduces this slice:


            
          
Figure 4. The subdataset builder. Pick a keyword; the count comes straight from /api/v1/count, and the snippet is the exact code to reconstruct that slice from the API — the thing we're getting cleared for outside-the-university access. Until that clearance lands it answers only from the campus network.

1 Counts are token-match counts from the FM-index, not article counts: a term that appears twice in one article counts twice. The per-year split is exact, read from the index's shard map (each of the 117 shards carries its year), not a sample. The x-axis is crawl date (warc_date), close to but not identical to publication date; 2016 is partial (CC-NEWS starts that August), so its point sits low until you switch to per-million.

  Live data: infini-news.uni-graz.at · infini-gram-mini FM-index · index ccnews · 1,357,027,742 articles.