I understand the impulse to introduce UUIDs for schemas but I think that’s trying to solve a problem we don’t actually have. This is a sketch of one way to do schema and collection management that avoids them and (I think?) has all the properties we want.
Here’s what I imagine:
Schemas are developed and versioned entirely separately from collections. In fact let’s say that the entire schema world is R1-specific and is completely outside of “the Underlay Protocol”. The only thing that’s part of the Underlay protocol is the compiled collection, which we’ll get to later.
Half of R1 is dedicated to schemas. You can browse schemas, create a new schema, edit it, and release versions of it. Versions have semver tags. Maybe it uses Git since they’re text-based, and with semver release tags, just like Go. But any way we do it, R1 is responsible for “implementing” versioning so that R1 can resolve versioned schema URLs like http://r1.underlay.org/schema/baylor/foo/v0.3.1
. We don’t even necessarily have an API for interacting with schemas (although using Git would make this easy!).
Schema imports are R1 schema URLs with an exact version like this:
[[import]]
url = "http://r1.underlay.org/schemas/baylor/snap"
version = "4.2.0"
[[import]]
url = "http://r1.underlay.org/schemas/baylor/crackle"
version = "0.3.1"
[[import]]
url = "http://r1.underlay.org/schemas/emerson/pop"
version = "45.0.0"
Or maybe even like this:
import = [
"http://r1.underlay.org/schemas/baylor/snap@v4.2.0",
"http://r1.underlay.org/schemas/baylor/crackle@0.3.1",
"http://r1.underlay.org/schemas/emerson/pop@45.0.0"
]
Schemas are just like git repos and don’t have permanent identity. Maybe someone renames their schema to a new URL. They can do that and the old URL no longer resolves.
Okay, so how do schemas integrate with the collection lifecycle?
Pretend that collection.json
has this minimal format:
type Collection = {
name: string
schema: { url: string; version: string }[]
provenance: string
assertions: string[]
}
where assertions
is a set of hashes (not important here).
Like package.json
, collection.json
is a file that is not typically edited manually, and is instead edited through a GUI on R1 or through command-line tools (let’s focus on only these two interfaces for now). “Editing” for our purpose here is just adding, removing, or updating schema imports.
Every time that collection.json
is edited, the compiled collection file collection.nq
(or whatever) gets re-compiled. collection.nq
is an RDF dataset that has the collection metadata like its name and its member assertions, and also a compiled collection schema.
A “compiled collection schema” is a canonical RDF representation of the schema import tree. In other words, to compile the schema tree, you:
- recursively fetch all the imported URLs and hash the (TOML) results
- if the tool has a cached copy of a URL it can just use that
- overwrite re-defined labels in order
- get a canonical RDF representation of it all in a “schema schema”
What’s especially nice about this is that we can put the tree of TOML sources in the provenance of the compiled collection.
Specifically, the compiled collection gets put into a named graph, and this named graph is an entity in the default graph with this schema:
[shapes.CompiledCollection]
generatedAt = "dateTime"
derivedFrom = { kind = "uri" } # the hash of package.json
[shapes.CompiledCollection.Imports]
kind = "reference"
label = "SchemaImport"
cardinality = "any"
[shapes.SchemaImport]
# unfortunately cardinality="any" values are not sorted,
# but the compiled schema is order-sensitive,
# so we have to include explicit indices
index = "integer"
version = "string"
# this is the regular schema URL
url = { kind = "uri" }
# this is a content-addressed URI for the TOML text of that version
text = { kind = "uri" }
[shapes.SchemaImport.Imports]
kind = "reference"
label = "SchemaImport"
cardinality = "any"
(look! it’s our schema language in action!!)
So here, the named graph that contains the compiled schema will appear as an instance of the CompiledCollection
shape in the default graph, along with a tree of SchemaImport
instances that reference the exact hashed TOML that was used to produce it.
This is all down in the weeds, but the takeaway is that:
- Every time you edit a collection you regenerate a compiled collection
- The compiled collection has a copy of the schema import tree, including every URL, version, and hash
- The compiled collection is included in every published version of the collection
- The TOML sources of all the imported schemas are included in every published version of the collection
(to clarify: the compiled collection dataset has all of the metadata in collection.json
, and lists all of the included assertions by hash, but does not literally contain those assertions, and does not literally contain the source TOML text)
Okay, how does this design play with different day-to-day workflows?
- Someone downloads a collection for local use. Great! All the imported schemas are bundled in this collection so they can look at them if they want, and verify that the assertions all validate by generating a compiled collection themselves.
- Someone wants to upgrade a collection they have locally. Great! You just fetch the new collection. There’s nothing to update yet because for now versions aren’t related to each other very strongly, they’re just different collections.
- Someone creates a new collection on R1. To do this you need a schema! Browse schemas on r1.underlay.org and pick some you like to get started. Or create your own.
- Someone creates a new schema on R1. Cool - maybe you want to import some existing ones (at specific versions).
- Someone updates a schema on R1. Great - make the edits and publish a new version.
- Someone upgrades an import version in a schema. Sure - this is no different than any other edit, but you’ll obviously want to review the changes and make sure that you actually want to do this. You’re changing a schema, after all.
-
Someone wants to change the schema of their collection. Absolutely. But the specifics matter a lot here. If the schema only added new types without changing any of the old ones, you can just set the new version in
collection.json
and all your existing assertions will still validate. But if there are any edits to types that you’re using, you’ll require a migration, which can be a huge undertaking, depending on the specific kinds of changes involved. We don’t have any migration tools yet, so that basically means re-generating all of the assertions (from whatever workflow / pipeline you got them from in the first place). This cannot be automated for now and will require human attention. - Someone wants to upgrade a schema import version in a collection. Okay, fetch the new version. But this is no different than any other edit - if there are any changes to old types, it will require a migration and you’ll need to do manual work to re-generate your entire collection.
- Someone wants to know if newer versions of a schema are available. Sure, R1 will have some HTTP API where you can ask what versions of a schema at a URL are available. Or, to begin with, you just do this in the GUI. Maybe this 404s because the schema moved or got deleted! That means you have to track down where it is and enter the new URL yourself. Maybe R1 uses UUIDs internally to give you informative 301 / 410 responses.
- Someone renames their organization or schema. Okay. Future requests to the URL will 404, but nobody’s collections break or anything.
Basically this is the vendoring approach to package dependencies - collections vendor their import tree, so they aren’t affected by linkrot.
That said, I think we should be careful of treating schemas too much like software libraries. There’s no equivalent of a “bugfix” - if you fix a typo in a property, then everyone who consumes your schema doesn’t just have to upgrade to the new schema version - they have to regenerate all of their own data. This is just how it is; every schema change requires a migration, and a major goal for us should be to start working on migrations, but for now we don’t have tools for that. So even though this all has the overall architecture of a package manager, it has a different kind of usage profile. We shouldn’t zoom in too much on the UX of “upgrading schemas” because for now most schema changes are catastrophic no matter what we do.
One point that Travis brought up earlier is that having collections import an array of schemas prevents us from having a single URL for the schema of a collection. I think this is fine - everything that we would want to do around “checking if two collections have compatible schemas” can be done just as easily with arrays. I could easily be missing something obvious but I don’t know what we’d need specifically singular URLs for collection schemas for.
Lastly, we need to consider what exactly we mean by “semantic versioning” here. How do we interpret major/minor/patches to schemas? The gradations I can think of are
- changes to comments / documentation only
- adding new types
- anything involving editing / removing / types
Or maybe it’d make more sense to combine 1 and 2, and save the major/minor difference for different kinds of migrations - as in “a minor bump comes with a formally verified migration” (once we have them!) and “a major bump will require you to regenerate your data from scratch”. Either way we should be honest that since schema upgrades and code upgrades behave in different ways, we’re not literally using semver.
(One precedent for this is the badger key/value store - they use something they call “serialization versioning”. It turns out that there are plenty of kinds of projects that semver just doesn’t apply to.)
I feel like I wrote this in a confusing way, and what I’m proposing is actually way simpler than it seems. Let me know if anything doesn’t make sense.