News from Mar 22, 2012
In recent years, more and more websites have started to embed structured data describing products, people, organizations, places, events, resumes, cooking recipes as well as other types on entities into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa.
We are happy to announce WebDataCommons.org, a joined project of Freie Universität Berlin and the Karlsruhe Institute of Technology to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public.
WebDataCommons.org provides the extracted data for download in the form of RDF-quads. In addition, we produce basic statistics about the extracted data.
Up till now, we have extracted data from two Common Crawl web corpora: One corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a second corpus consisting of 1.4 billion HTML pages dating from February 2012.
The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe 1.5 billion entities and originate from 19.1 million websites.
The February 2012 extraction resulted in 3.2 billion RDF quads which describe 1.2 billion entities and originate from 65.4 million websites.
More detailed statistics about the distribution of formats, entities and websites serving structured data, as well as growth between 2009/2010 and 2012 is provided on the project website:
It is interesting to see form the statistics that the RDFa and Microdata deployment has grown a lot over the last years, but that Microformat data still makes up the majority of the structured data that is embedded into HTML pages (when looking at the amount of quads as well as the amount of websites).
We hope that will be useful to the community by:
Web Data Commons is a joint effort of Christian Bizer and Hannes Mühleisen (Web-based Systems Group at Freie Universität Berlin) and Andreas Harth and Steffen Stadtmüller (Institute AIFB at the Karlsruhe Institute of Technology).
Lots of thanks to: