I think we should use TOML as a the base format, because it’s the JSON data model we all know and love (almost) but also allows comments. It’d be good to browse the homepage before reading
# I'm a schema!
# And these are comments!
# We definitely should version the schema format.
# Not sure what a good name for this is that won't get
# confused with meaning "the version of *this* schema"
formatVersion = 1
# Schemas have a top-level .import field, which is an
# array of { url, version } objects.
# In TOML, you can list them like this (note the double brackets):
[[import]]
url = "http://r1.underlay.org/schemas/baylor/snap"
version = "4.2.0"
[[import]]
url = "http://r1.underlay.org/schemas/baylor/crackle"
version = "0.3.1"
[[import]]
url = "http://r1.underlay.org/schemas/emerson/pop"
version = "45.0.0"
Versions in this TOML format are just semver strings. Every time a version of a collection gets published, its schema imports get resolved and compiled into one big flat schema in an unreadable RDF format, and that’s what gets hashed, similar to package-lock.json. This deserves its own discussion somewhere else, but the point is that the TOML format is purely human-readable and human-editable.
All that importing does is let you reference the imported types as the values of properties, which we’ll see later. You can’t “extend” types.
If a type in a schema is defined with the same label as an imported type, the imported one is just ignored. Similarly, the imports themselves overwrite each other in order if there are conflicts (it’s important that .import
is an array). But since everything will be namespaced, collisions should never really happen. Speaking of which:
# Schemas also have a required top-level namespace string.
# This has to be a URI that ends in "/" or "#"
namespace = "http://foo.com/bar/"
Great. Now on to types, which live in top-level .types
object:
# Types have zero or more properties.
[types.Skyscraper]
# This one has zero.
There are two kinds of properties that types can have: literal properties and reference properties (still thinking about names for these, lmk wyt).
Literals are one of string
, integer
, double
, boolean
, dateTime
, and date
(ie the xsd datatypes that I think are the most common).
References point to another type.
Every property has an associated cardinality, which is either required
, optional
, or any
. any
means that there can be any number of values (zero or more). Values are not ordered. required
is the default cardinality if not specified.
As a special shortcut, you can define literal properties by just saying:
[types.Person]
name = "string"
age = "integer"
But you can only do this for literals, and the implied cardinality is required
. In general, properties are defined like this:
[types.Person]
[types.Person.name]
type = "string"
cardinality = "any"
[types.Person.age]
type = "integer"
cardinality = "optional"
[types.Person.knows]
reference = "Person"
cardinality = "any"
# This is NOT VALID!
# properties have to either be literals or references
[types.Person.baz]
type = "integer"
reference = "Person"
If we wanted to reference an imported type, we’d have to use its full URI, like this:
[types.Person.hometown]
reference = "http://r1.underlay.org/schemas/common/City"
cardinality = "optional"
Some of these things could use better names - in particular, I don’t feel great about "reference = ", and I don’t feel great about “types.”
Also we could easily add “JSON” as a datatype, and maybe we should, but there are some potential downsides. It’d be a good escape hatch but it wouldn’t be good if people just use that for things that could be properly typed.
But wait!? What about provenance!?
Instead of trying to define two separate data- and prov-level schemas etc etc, a simpler approach would just be this: collection.toml has a .schemas
array and a .provenance
key. Here what I mean:
- Collections specify an array of schemas (ie implicitly importing them all). They do this with the exact same
{url: string; version: string}[]
format. - Collections also have a top-level
.provenance: string
property (or some other name like.meta
or.graph
). The value of that property is a URI that has to be one of the labels imported in one of the schemas. - Assertions in a collection validate when:
- The contents of all the named graphs validate the imported schemas
- The named graph labels appear in the default graph as instances of the type indicated by the
.provenance
key. There could be other things in the default graph as necessary.
So to make all this concrete, suppose we have this schema
namespace = "http://r1.underlay.org/common/"
[types.Person]
name = "string"
[types.Person.knows]
reference = "Person"
cardinality = "any"
and somewhere else we have this schema
namespace = "http://www.w3.org/ns/prov#"
[types.Entity]
[types.Entity.name]
type = "string"
cardinality = "optional"
[types.Derivation]
[types.Derivation.subject]
reference = "Entity"
[types.Derivation.entity]
reference = "Entity"
[types.Derivation.comment]
type = "string"
cardinality = "optional"
which depicts a simple PROV model where entities, which have names, are derived (with comments) from other entities.
Then, a collection.toml would start with something like this:
[[schema]]
url = "http://r1.underlay.org/common"
version = "1.4.3"
[[schema]]
url = "http://r1.underlay.org/prov"
version = "1.0.0"
provenance = "http://www.w3.org/ns/prov#Entity"
(Note that the “import URL” doesn’t necessarily correspond at all to the URI labels defined in the schema that you end up importing. It’s just a directive to the compiler at where to look.)
Okay, so what does an assertion in this collection look like? Well it has some named graphs with data in it.
PREFIX common = "http://r1.underlay.org/common/"
_:b0 rdf:type common:Person _:g1 .
_:b0 common:Person/name "Joel" _:g1 .
_:b1 rdf:type common:Person _:g1 .
_:b1 common:Person/name "Travis" _:g1 .
_:b2 rdf:type common:Person/knows _:g1 .
_:b2 ul:source _:b0 _:g1 .
_:b2 ul:target _:b1 _:g1 .
(This is all in the named graph _:g1
. Also note the slash in common:Person/name
- the “dots” in Person.name
in the schema are path elements in the implied URI).
Notice that - woah! - cardinality-any
properties like knows
get reified with their own blank node with source
and target
predicates. Under the hood, cardinality-any
properties are really just a shorthand way of defining another type with required source
and target
properties. This is a very good thing to do and sets us up well for extending the data model (e.g. with edge properties) in the future.
Anyway, what do we put in the default graph to make this a valid assertion?
… Well we have make _:g1
a valid instance of prov:Entity
.
This means adding an rdf:type for it, and a value for at least all of its properties.
_:g1 rdf:type prov:Entity .
_:g1 prov:comment _:b3 .
_:b3 ul:none _:b4 .
… hmmm what’s going on here? Well, cardinality-optional
properties “compile” down to cardinality-required
properties under the hood, just like cardinality-any
properties did. In this case, the value that every entity is required to have is “either a comment, or nothing” - every entity has to have one of those values! The value of an “or” type like that is a single blank node with one outgoing predicate. Which predicate it is tells you what type to expect at the other end (where the “nothing” at the other end is represented as a dangling blank node _:b4
). These are known as “discriminated unions” or “tagged nulls” and all sorts of other names.
So if we actually did have a comment for this entity, we’d write something like:
_:g1 rdf:type prov:Entity .
_:g1 prov:comment _:b3 .
_:b3 ul:some "This is a graph that I found on the street" .
Again, this sets us up //really// well for a more expressive data model in the future. And I think and hope that very few people (basically just us, aka Underlay developers) will ever have to actually touch the RDF representation like this.
We could tell a bigger story in the default graph too, if we wanted! I was going to write out a bigger example using a Derivation
from the prov schema, but I feel like I’ve gotten the point across and it may be out of scope for this post.
The gist w/r/t provenance is that we make collections declare what type their named graphs are going to be, which could be dead-simple (no properties) or complicated (entities/prov/etc). I expect that most prov won’t be that complicated, and that the 90% case will be to use a type like the example I gave that just has an optional string comment field.
Lastly, here’s a quick JSON-LD representation of the example assertion:
{
"@context": {
"ul": "http://underlay.org/ns/",
"common": "http://r1.underlay.org/schemas/common/",
"prov": "http://www.w3.org/ns/prov#"
},
"@type": "prov:Entity",
"prov:comment": {
"ul:some": "This is a graph that I found on the street"
},
"@graph": [
{
"@id": "_:joel",
"@type": "common:Person",
"common:Person/name": "Joel"
},
{
"@id": "_:travis",
"@type": "common:Person",
"common:Person/name": "Travis"
},
{
"@type": "common:Person/knows",
"ul:source": { "@id": "_:joel" },
"ul:target": { "@id": "_:travis" }
}
]
}
The only predicates that we need to reserve in the ul:
namespace for this preliminary data model are source
, target
, some
, and none
. The two pairs have the same number of characters, which is a sign from God that we’re on the right track.