The technical foundation of the Underlay has four concrete pieces:
- an abstract data model (schemas/types/instances/elements/values)
- a binary serialization (how to encode and decode instances)
- a schema language (a text language for authoring schemas)
- an HTTP Collection API (interacting with a version history of instances)
I’m confident in these as such, but not as confident in what to call each of them.
Here’s quick oral history of how we got here:
In the beginning (this time last year), the data model didn’t have a proper name, and I was using the umbrella label “APG” for all this. The data model was published as @underlay/apg
, the codec for the binary serialization was published as @underlay/apg-format-binary
, and all development was kept in the underlay/apg monorepo.
When I started thinking about a schema I thought we needed to name it (just the schema language, not the rest), which is what “tasl” originally was supposed to be: tiny algebraic schema language. tasl was a front-end to APG schemas.
After developing things a little more I began to feel a little uncomfortable with the “APG” label - I didn’t know if it was appropriate to take it so directly from the paper, and our data model has some crucial additional characteristics that weren’t in the paper. So I began transitioning everything to a consolidated world where the data model, binary serialization, and schema language were all “tasl”. The HTTP Collection API still felt completely separate - part of the Underlay, built on tasl.
Since the four pieces themselves feel conceptually stable to me, now could be a good time to have a conversation about names once and for all.
I’ll start with two observations that have been making me question the data model + binary serialization + schema language grouping:
1. many or most uses of the data model won’t need the schema language
I originally thought that basically every schema would be written in the text language, but when developing R0 I found that we weren’t actually using it at all. On R0 every schema is produced (as data / as an object) from pipeline blocks, each of which take their own (very constrained and domain-specific) kind of user input. A general schema language just wasn’t part of the story. In the implementation, each block just used a data model library (@underlay/apg
at the time) to assemble, manipulate, or reason about different schemas as objects.
I think this is going to be pretty common. The archetype is a little tool that publishes a CSV as a collection - the right way to do this is to generate the schema along with the data. I said a couple times that the only real role of the “schema” resource on R0 would be for collaborating on specifications, and I still think that’s about right. It’s important to have a general schema language, but we should expect in practice that tools will generally be generating their schemas as data.
2. many or most uses of the data model won’t use the binary serialization
For a long time, I thought that the Collection API would be cleanly separated from tasl internals. For example, I thought that if you GET a version of a collection you’d just get the serialized binary instance in the HTTP response, and that the collection API wouldn’t expose anything internal to each instance. But the closer I get to actually implementing this, the more I’ve started to worry about practicality and size - “what about when people publish really large collections?”
This should be something we have a separate discussion about - picking a “maximum supported collection size”, not that we actually enforce, but that we use as a design guideline. If collections max out at a couple gigabytes, then downloading binary blobs is no problem. But we have to do something different if we want to interact with files that are a couple terabytes or larger.
Worrying about this made me reconsider the clean separation between tasl and the Collection API. Wouldn’t it make sense to let people GET a specific element within an instance? Wouldn’t it make sense to let people iterate over elements? Wouldn’t it make sense to just let people query the instance with JSON? And if that turns into the primary way that people interact with instances, then isn’t that tasl too?
This is still pretty underdeveloped and I don’t understand the whole picture yet. Probably it’ll make sense to let Underlay V1 run with a very minimal naive blob-oriented collection API, but it still feels relevant to considering where tasl starts and ends.
I think our goal here should just be distinct names for each of the four parts. They could all be related - I’m not opposed to “Underlay Data Model” / “Underlay Schema Language” / etc. Super concretely, we need canonical file extensions for text schemas (right now that’s .tasl
), binary schemas (right now that’s .schema
), and binary instances (right now that’s .instance
).