As avid communicators, we share stories, songs, ideas, and observations – changing with time and audience, but reasonably described in terms of linear versions. Where there is a wide range of derivative works and remixes and crossovers, archiving / versioning / creation history become difficult (still an unsolved challenge in music, performance, and literature; hidden in part by oversimplified assumptions in modern norm + law around authorship).
For larger collections and compilations, from concordances and databases to codebases and long-term studies, versioning is more interesting.
Some repositories focus on datasets and their analyses, borrowing from both version-control and file-repository tools.
Open source hosted repositories:
- Dataverse – offers a way to inspect and query raw uncompressed datasets in a folder, without downloading it elsewhere. Offers both a globally hosted solution (the Harvard dataverse) and repository software you can run locally in a datacenter of your choice (55 dataverses worldwide).
- Zenodo – offers a way to host and semi-permanently archive very large files and datasets, up to 10TB. Maintained by OpenAIRE and CERN.
Other hosted repositories:
- Github itself
- dat
Other places that gather and organize data (not self-serve):
- datahub.io
- data.world - commercial
Partial alternatives, with few promises of maintenance or shared tools for sharing, include self-hosting (Google Dataset Search) and institutional data repositories (often preserving one-time uploads of data associated with research).