Software Heritage

Software Heritage is a non-profit organization which provides a service for archiving and referencing historical and contemporary software with a focus on human readable source code. The site was unveiled in 2016 by Inria and is supported by UNESCO. The project itself is structured as a nonprofit multistakeholder initiative.

Overview
The stated mission of Software Heritage is to collect, preserve and share all software that is publicly available in source code form, with the goal of building a common, shared infrastructure at the service of industry, research, culture and society as a whole.

Software source code is collected by crawling code hosting platforms, like GitHub, GitLab.com or Bitbucket, and packages archives, like npm or PyPI, and ingested into a special data structure, a Merkle DAG, that is the core of the archive. Each artifact in the archive is associated with an identifier called a SWHID. In 2023, the expansion of SWHID was changed from Software Heritage identifier to software hash identifier.

In order to increase the chances of preserving the Software Heritage archive over the long term, a mirror program was established in 2018, joined by ENEA and FossID as of October 2020.

History
Development of Software Heritage began at Inria under the direction of computer scientists Roberto Di Cosmo and Stefano Zacchiroli in early 2015, and the project was officially announced to the public on June 30, 2016.

In 2017 Inria signed an agreement with UNESCO for the long-term preservation of software source code and for making it widely available, in particular through the Software Heritage initiative.

In June 2018, the Software Heritage Archive was opened at UNESCO headquarters.

On July 4, 2018, Software Heritage was included in the French National Plan for Open Science.

In October 2018 the strategy and vision underlying the mission of Software Heritage were published in Communications of the ACM.

In November 2018, a group of forty international experts met at the invitation of Inria and UNESCO, which led to the publication in February 2019 of Paris Call: Software Source Code as Heritage for Sustainable Development.

In November 2019, Inria signed an agreement with GitHub to improve the archival process for GitHub-hosted projects in the Software Heritage archive.

As of October 2020, Software Heritage’s repository held over 143 million software projects in an archive of over 9.1 billion unique source files.

Funding
Software Heritage is a non-profit organization, funded largely from donations from supporting sponsors, that include private companies, public bodies and academic institutions.

Software Heritage also seeks support for funding third parties interested in contributing to its mission. A grant from NLNet funded the work of Octobus and Tweag that led to rescuing 250.000 Mercurial repositories phased out from Bitbucket.

A grant from the Alfred P. Sloan Foundation funds experts to develop new connectors for expanding coverage of the Software Heritage Archive

Development and community
The Software Heritage infrastructure is built transparently and collaboratively. All the software developed in the process is released as free and open-source software. An ambassador program has been announced in December 2020 with the stated goal to grow the community of users and contributors.

Awards
In 2016 Software Heritage received the best community project award at Paris Open Source Summit 2016.

In 2019 Software Heritage received the award of Academic Initiative from the Pôle Systematic.