Schema imports and lifecycle

I understand the impulse to introduce UUIDs for schemas but I think that’s trying to solve a problem we don’t actually have. This is a sketch of one way to do schema and collection management that avoids them and (I think?) has all the properties we want.

Here’s what I imagine:

Schemas are developed and versioned entirely separately from collections. In fact let’s say that the entire schema world is R1-specific and is completely outside of “the Underlay Protocol”. The only thing that’s part of the Underlay protocol is the compiled collection, which we’ll get to later.

Half of R1 is dedicated to schemas. You can browse schemas, create a new schema, edit it, and release versions of it. Versions have semver tags. Maybe it uses Git since they’re text-based, and with semver release tags, just like Go. But any way we do it, R1 is responsible for “implementing” versioning so that R1 can resolve versioned schema URLs like http://r1.underlay.org/schema/baylor/foo/v0.3.1. We don’t even necessarily have an API for interacting with schemas (although using Git would make this easy!).

Schema imports are R1 schema URLs with an exact version like this:

[[import]]
url = "http://r1.underlay.org/schemas/baylor/snap"
version = "4.2.0"

[[import]]
url = "http://r1.underlay.org/schemas/baylor/crackle"
version = "0.3.1"

[[import]]
url = "http://r1.underlay.org/schemas/emerson/pop"
version = "45.0.0"

Or maybe even like this:

import = [
  "http://r1.underlay.org/schemas/baylor/snap@v4.2.0",
  "http://r1.underlay.org/schemas/baylor/crackle@0.3.1",
  "http://r1.underlay.org/schemas/emerson/pop@45.0.0"
]

Schemas are just like git repos and don’t have permanent identity. Maybe someone renames their schema to a new URL. They can do that and the old URL no longer resolves.

Okay, so how do schemas integrate with the collection lifecycle?

Pretend that collection.json has this minimal format:

type Collection = {
  name: string
  schema: { url: string; version: string }[]
  provenance: string
  assertions: string[]
}

where assertions is a set of hashes (not important here).

Like package.json, collection.json is a file that is not typically edited manually, and is instead edited through a GUI on R1 or through command-line tools (let’s focus on only these two interfaces for now). “Editing” for our purpose here is just adding, removing, or updating schema imports.

Every time that collection.json is edited, the compiled collection file collection.nq (or whatever) gets re-compiled. collection.nq is an RDF dataset that has the collection metadata like its name and its member assertions, and also a compiled collection schema.

A “compiled collection schema” is a canonical RDF representation of the schema import tree. In other words, to compile the schema tree, you:

  • recursively fetch all the imported URLs and hash the (TOML) results
  • if the tool has a cached copy of a URL it can just use that
  • overwrite re-defined labels in order
  • get a canonical RDF representation of it all in a “schema schema”

What’s especially nice about this is that we can put the tree of TOML sources in the provenance of the compiled collection. :innocent:

Specifically, the compiled collection gets put into a named graph, and this named graph is an entity in the default graph with this schema:

[shapes.CompiledCollection]
generatedAt = "dateTime"
derivedFrom = { kind = "uri" } # the hash of package.json

[shapes.CompiledCollection.Imports]
kind = "reference"
label = "SchemaImport"
cardinality = "any"

[shapes.SchemaImport]
# unfortunately cardinality="any" values are not sorted,
# but the compiled schema is order-sensitive,
# so we have to include explicit indices
index = "integer"
version = "string"
# this is the regular schema URL
url = { kind = "uri" }
# this is a content-addressed URI for the TOML text of that version
text = { kind = "uri" }

[shapes.SchemaImport.Imports]
kind = "reference"
label = "SchemaImport"
cardinality = "any"

(look! it’s our schema language in action!!)

So here, the named graph that contains the compiled schema will appear as an instance of the CompiledCollection shape in the default graph, along with a tree of SchemaImport instances that reference the exact hashed TOML that was used to produce it.

This is all down in the weeds, but the takeaway is that:

  • Every time you edit a collection you regenerate a compiled collection
  • The compiled collection has a copy of the schema import tree, including every URL, version, and hash
  • The compiled collection is included in every published version of the collection
  • The TOML sources of all the imported schemas are included in every published version of the collection

(to clarify: the compiled collection dataset has all of the metadata in collection.json, and lists all of the included assertions by hash, but does not literally contain those assertions, and does not literally contain the source TOML text)

Okay, how does this design play with different day-to-day workflows?

  • Someone downloads a collection for local use. Great! All the imported schemas are bundled in this collection so they can look at them if they want, and verify that the assertions all validate by generating a compiled collection themselves.
  • Someone wants to upgrade a collection they have locally. Great! You just fetch the new collection. There’s nothing to update yet because for now versions aren’t related to each other very strongly, they’re just different collections.
  • Someone creates a new collection on R1. To do this you need a schema! Browse schemas on r1.underlay.org and pick some you like to get started. Or create your own.
  • Someone creates a new schema on R1. Cool - maybe you want to import some existing ones (at specific versions).
  • Someone updates a schema on R1. Great - make the edits and publish a new version.
  • Someone upgrades an import version in a schema. Sure - this is no different than any other edit, but you’ll obviously want to review the changes and make sure that you actually want to do this. You’re changing a schema, after all.
  • Someone wants to change the schema of their collection. Absolutely. But the specifics matter a lot here. If the schema only added new types without changing any of the old ones, you can just set the new version in collection.json and all your existing assertions will still validate. But if there are any edits to types that you’re using, you’ll require a migration, which can be a huge undertaking, depending on the specific kinds of changes involved. We don’t have any migration tools yet, so that basically means re-generating all of the assertions (from whatever workflow / pipeline you got them from in the first place). This cannot be automated for now and will require human attention.
  • Someone wants to upgrade a schema import version in a collection. Okay, fetch the new version. But this is no different than any other edit - if there are any changes to old types, it will require a migration and you’ll need to do manual work to re-generate your entire collection.
  • Someone wants to know if newer versions of a schema are available. Sure, R1 will have some HTTP API where you can ask what versions of a schema at a URL are available. Or, to begin with, you just do this in the GUI. Maybe this 404s because the schema moved or got deleted! That means you have to track down where it is and enter the new URL yourself. Maybe R1 uses UUIDs internally to give you informative 301 / 410 responses.
  • Someone renames their organization or schema. Okay. Future requests to the URL will 404, but nobody’s collections break or anything.

Basically this is the vendoring approach to package dependencies - collections vendor their import tree, so they aren’t affected by linkrot.

That said, I think we should be careful of treating schemas too much like software libraries. There’s no equivalent of a “bugfix” - if you fix a typo in a property, then everyone who consumes your schema doesn’t just have to upgrade to the new schema version - they have to regenerate all of their own data. This is just how it is; every schema change requires a migration, and a major goal for us should be to start working on migrations, but for now we don’t have tools for that. So even though this all has the overall architecture of a package manager, it has a different kind of usage profile. We shouldn’t zoom in too much on the UX of “upgrading schemas” because for now most schema changes are catastrophic no matter what we do.

One point that Travis brought up earlier is that having collections import an array of schemas prevents us from having a single URL for the schema of a collection. I think this is fine - everything that we would want to do around “checking if two collections have compatible schemas” can be done just as easily with arrays. I could easily be missing something obvious but I don’t know what we’d need specifically singular URLs for collection schemas for.

Lastly, we need to consider what exactly we mean by “semantic versioning” here. How do we interpret major/minor/patches to schemas? The gradations I can think of are

  1. changes to comments / documentation only
  2. adding new types
  3. anything involving editing / removing / types

Or maybe it’d make more sense to combine 1 and 2, and save the major/minor difference for different kinds of migrations - as in “a minor bump comes with a formally verified migration” (once we have them!) and “a major bump will require you to regenerate your data from scratch”. Either way we should be honest that since schema upgrades and code upgrades behave in different ways, we’re not literally using semver.

(One precedent for this is the badger key/value store - they use something they call “serialization versioning”. It turns out that there are plenty of kinds of projects that semver just doesn’t apply to.)

I feel like I wrote this in a confusing way, and what I’m proposing is actually way simpler than it seems. Let me know if anything doesn’t make sense.

Quick thoughts: Naming and agreeing on schemas, and defining mappings between them, seems fundamental enough to dedicate part of R1 to it. Over time, some of the things we might want to do include

  • Creating schemas
  • Uploading schemas from elsewhere
  • Naming, renaming, + creating aliases
  • Browsing + selecting schemas
  • Annotating with notes + examples
  • Editing + releasing versions
  • Defining links between them: inheritance, derivatives, equivalence

Naming a collection-schema
In part for some of the reasons you mention about brittleness around schema changes, it seems useful to have a single {url, version} for the compiled collection-schema. Thought experiment: start w/ a world where users are encouraged to reuse an existing collection-schema, rather than construct a new one. Then the longer list of subschemas would be all stored in the schema referred to.

There’s no equivalent of a “bugfix” - if you fix a typo in a property, then everyone who consumes your schema doesn’t just have to upgrade to the new schema version - they have to regenerate all of their own data … for now most schema changes are catastrophic no matter what we do.

Worth a separate discussion: you may want to know about such changes, and note them locally, without recompiling your own data (until you absolutely have to).

Thanks for writing this up! I’ve spent a bunch of time thinking about this the past few months - less from the technical angle, and more from the R1 product angle. I’m on board with lots of the technical suggestions (e.g. collection compilation, serialization versioning) - but want to give the context of where I’m coming from.

My thinking started at a very similar place: imagine two separate ‘sections’ of R1 for Schemas and Collections. From there, resonating with SJ’s point, I started to think through all the ways people would want to communicate around a given schema. They may want to

  • include a README,
  • to include a sample dataset to make example queries possible
  • add users with permissions to update/moderate the schema and its metadata, and
  • have a space for Discussions about the schema.

The feature set started looking a lot like the features we’ve been talking about for Collections. Additionally, I think it may be confusing to have the publication of a simple CSV result in two “things”. I think most early (i.e. novice - which we still are) use cases will have a completely bespoke schema. The ‘Schema space’ on R1 would likely be filled with a long tail of one-off schemas that really ought to be understood in the context of the collection their associated with.

This made me start thinking about how we could simplify to having a single “thing” (i.e. Collections) that capture the uses we’ve been talking about. The ‘single URL’ vs ‘array of schemas’ thing for me comes from this attempt at combination, so we can re-use the collection name/id when talking about the schema. Though, that said, I suppose there’s no reason resolving an array is any different than resolving a single file. So, perhaps I’m not too hung up on that specific.

I think all I’m pitching is the following:

[[import]]
url = "http://r1.underlay.org/schemas/baylor/snap"
id = "fffc238b-5f58-487d-b892-f679dc82047e"
version = "4.2.0"

When compiling the collection, the URL is ignored and the id is used, but, when inspecting uncompiled files, the url is valuable for human-legibility. If the author updates the collection name (or profile name), the URL can be updated when it’s noticed that the id doesn’t match the url. In the world where collection.json files aren’t manually edited (but perhaps manually inspected), populating the id field seems trivial, and saves the hassle of trying to track down any redirects.

This still feels a bit fuzzy to me, but I feel like it may be useful to have an explicit schema version for collection schemas that simply import another collection’s schema.

Say you’re creating a collection baylor/snap and using the schema emerson/crackle@v2.0.1.

Your schema.toml simply looks like:

[[import]]
url = "http://r1.underlay.org/schemas/emerson/crackle"
id = "fffc238b-5f58-487d-b892-f679dc82047e"
version = "2.0.1"

You call this schema/baylor/snap@v0.0.1. Someone then decides they’re going to download your data set, add some new shapes, and publish that. Each time you update your collection version, they do the same: download your data, add some shapes, and publish. At some point though, you (baylor/snap) add a shape to your schema:

[[import]]
url = "http://r1.underlay.org/schemas/emerson/crackle"
id = "fffc238b-5f58-487d-b892-f679dc82047e"
version = "2.0.1"

[shapes.Person]
[shapes.Person.imdb]
kind: "uri"

Now, the person watching your data sees that you pushed schema/baylor/snap@v0.1.1 (they get a notifcation because they were importing schema/baylor/snap@v0.0.1). If instead, the original dataset simply refenced the emerson/crackle@v2.0.1 schema, there would be no name to help that watcher know something on your end (the baylor/snap end) is new.

Compiled Schema URIs would look like: http://r1.underlay.org/schemas/fffc238b-5f58-487d-b892-f679dc82047e/2.0.1.

Back to the ‘single file vs array’ - we can have schema/baylor/snap@v0.0.1 point to an array specified in the collection.json, but I think introducing the name schema/baylor/snap is still useful, and argues for every collection having it’s own schema-space even if it’s going to purely rely on commonly re-used public schemas.

I’m not sure I’m making this point well, so please ask questions or we can chat through it on a call. I think I’m on board with all the technical details you’ve suggested, with the additional notions of:

  1. Adding an id field to the uncompiled schema.toml imports (to simplify implementation of redirects and rename notifications)
  2. Having R1 implement schema namespacing for each collection (rather than having it be a separate ‘Schema space’).

Epilogue

After writing all that, I realize there’s some time dimension I’m not totally clear on. I realize that you could implement the id checks by always having the import command look to the compiled schema. That is, as long as a given schema (and collection) is compiled at the time of edit, the compiled schema could hold the id and hash of all imported schemas. In that case, when importing a schema, your collection.json would actually populate namespace information based on the id/hash found in the compiled schema (not the uncompiled schema). So, the only chance for a namespace to fall out from under you would be:

  1. You copy-pasted a value from an uncompiled schema.toml
  2. The namespace of the schema you’re importing changes during your compilation.

I think the question then is how often we expect people to do (1).

And I guess another question: what is the downside of including an id field in an uncompiled schema.toml (especially if schema.toml’s aren’t to be manually edited)? Just verbosity?

Let’s schedule a call to walk through all this in more depth!

A couple clarifications:

So I was imagining that in this scenario there would actually be no schema.toml, since collections import schemas directly. In other words, collection.toml (or .json or whatever) would have

[[schema]]
url = "http://r1.underlay.org/schemas/emerson/crackle"
version = "2.0.1"

Then in the situation you describe, you (baylor/snap) would only have to create a schema when you decide you want to add shapes of you own. You’d even have two ways of going about it:

  • the schema baylor/snap you create imports the old schema, and your collection only imports the new one you made
  • you create a schema with just the new shapes and no imports, and the collection imports both of them

(the second option is only possible if you don’t need to reference shapes from the first schema in the new shapes you add)

Either way, people “watching” your data can tell that the schemas imported by the collection have changed.

I think we might be using “namespace” in different ways. I intended the .namspace property on schema.toml to just be a prefix that lets us write out the labels of shapes and properties without needing to quote full URIs - not a value that’s actually associated with the schema. And I don’t think the namespace has any necessary relationship to the “schema URL” that it gets imported by.

In other words, this schema

namespace = "http://example.com/"

[shapes.Person]
name = "string"

could be equivalently written

[shapes."http://example.com/Person"]
"http://example.com/Person/name" = "string"

… which TOML allows.

thanks for writing all this, this has been a really interesting set of things to think about. Still formulating a bit, but had some thoughts. The following part gave me some pause:

In general, I think the part that people will screw up on the most, particularly at the start, will be schema formulation, as it’s a workflow that will be quite unfamiliar to most of the intended users of R1. However, I think SJ’s workaround…

…is a nice one.

My feeling is that users who’ll access R1 through the GUI (the CLI could be a good deal more specific, I think) will follow a bit of a Boltzmann distribution in terms of who really cares about how the schema is constructed (with most people caring very little, a few caring a lot). This presents us with a bit of a choice: do we convince people to care a lot initially and get them to get it broadly right but have quite a high usability barrier, or do we accept some inaccuracy on the first pass and invite clarity later.

I’m inclined toward the latter, and I feel like in this instance we could even quite actively encourage people to come back and revisit schemas that they made early on (that might just be strings rather than URIs), but whoever is using their dataset downstream wouldn’t need to re-download unless they care about it (itself a function of how many people care about schemas). Over time, that will hopefully shift to the right a bit as the number of things in R1 increases, and with them the value of Linked Data as a core part of the product.

I think more generally, accessible and ‘good’ schema construction will be the hardest and most interesting design decision - how to define the search space for ‘suggested’ schemas, etc - and will have a really big impact on how people use R1.

Right - I didn’t literally mean that downstream schema users have to re-generate on every new version, since they don’t have to upgrade at all. And there’s much less reason to “upgrade schemas” than there is to upgrade software (bugfixes etc), especially if you’re happy with the schema at the version you’re on.

I think it’s our job to design the whole thing in a way that conveys that authoring schemas is serious, serious business, especially when it comes to naming labels and properties. URIs are supposed to be forever. I don’t think of this as something we should try to “smooth over”, it’s something we design around by communicating these underlying principles to the authors.

I may not be explaining myself well, because what you describe doesn’t really get at the need I’m trying to solve for. The data and schema of a given collection are likely to change at very different rates, so “watching” data (i.e. collection version number) may not be the best way of knowing whether the schema changed (you’d still have to do a diff to the previous schema - or perhaps even a diff on all collection versions back to the one you last checked). Let’s talk through it on a call, I think this is a relatively minor point, but we’re maybe talking past each other.

Yep - seems we certainly are. I’m talking about the fact that colloquially, people are likely to refer to a given schema by underlay.org/schemas/emerson/maps@2.02. emerson/maps is the thing I’m referring to as ‘namespace’ here, but in the general sense of the term, not the schema-specific definition of the term. So, perhaps more accurately, I’m looking for a way to let schema URLs be relatively stable and to have canonical fallbacks that avoid 404s if a schema author decides to change their collection/schema URL.

I’m talking about the fact that colloquially, people are likely to refer to a given schema by underlay.org/schemas/emerson/maps@2.02 . emerson/maps is the thing I’m referring to as ‘namespace’ here, but in the general sense of the term, not the schema-specific definition of the term.

That’s how I normally use the term also. This may be a reason to avoid ‘namespace’ as a keyword.