What's tasl and what's not tasl?

The technical foundation of the Underlay has four concrete pieces:

  1. an abstract data model (schemas/types/instances/elements/values)
  2. a binary serialization (how to encode and decode instances)
  3. a schema language (a text language for authoring schemas)
  4. an HTTP Collection API (interacting with a version history of instances)

I’m confident in these as such, but not as confident in what to call each of them.

Here’s quick oral history of how we got here:

In the beginning (this time last year), the data model didn’t have a proper name, and I was using the umbrella label “APG” for all this. The data model was published as @underlay/apg, the codec for the binary serialization was published as @underlay/apg-format-binary, and all development was kept in the underlay/apg monorepo.

When I started thinking about a schema I thought we needed to name it (just the schema language, not the rest), which is what “tasl” originally was supposed to be: tiny algebraic schema language. tasl was a front-end to APG schemas.

After developing things a little more I began to feel a little uncomfortable with the “APG” label - I didn’t know if it was appropriate to take it so directly from the paper, and our data model has some crucial additional characteristics that weren’t in the paper. So I began transitioning everything to a consolidated world where the data model, binary serialization, and schema language were all “tasl”. The HTTP Collection API still felt completely separate - part of the Underlay, built on tasl.

Since the four pieces themselves feel conceptually stable to me, now could be a good time to have a conversation about names once and for all.

I’ll start with two observations that have been making me question the data model + binary serialization + schema language grouping:

1. many or most uses of the data model won’t need the schema language

I originally thought that basically every schema would be written in the text language, but when developing R0 I found that we weren’t actually using it at all. On R0 every schema is produced (as data / as an object) from pipeline blocks, each of which take their own (very constrained and domain-specific) kind of user input. A general schema language just wasn’t part of the story. In the implementation, each block just used a data model library (@underlay/apg at the time) to assemble, manipulate, or reason about different schemas as objects.

I think this is going to be pretty common. The archetype is a little tool that publishes a CSV as a collection - the right way to do this is to generate the schema along with the data. I said a couple times that the only real role of the “schema” resource on R0 would be for collaborating on specifications, and I still think that’s about right. It’s important to have a general schema language, but we should expect in practice that tools will generally be generating their schemas as data.

2. many or most uses of the data model won’t use the binary serialization

For a long time, I thought that the Collection API would be cleanly separated from tasl internals. For example, I thought that if you GET a version of a collection you’d just get the serialized binary instance in the HTTP response, and that the collection API wouldn’t expose anything internal to each instance. But the closer I get to actually implementing this, the more I’ve started to worry about practicality and size - “what about when people publish really large collections?”

This should be something we have a separate discussion about - picking a “maximum supported collection size”, not that we actually enforce, but that we use as a design guideline. If collections max out at a couple gigabytes, then downloading binary blobs is no problem. But we have to do something different if we want to interact with files that are a couple terabytes or larger.

Worrying about this made me reconsider the clean separation between tasl and the Collection API. Wouldn’t it make sense to let people GET a specific element within an instance? Wouldn’t it make sense to let people iterate over elements? Wouldn’t it make sense to just let people query the instance with JSON? And if that turns into the primary way that people interact with instances, then isn’t that tasl too?

This is still pretty underdeveloped and I don’t understand the whole picture yet. Probably it’ll make sense to let Underlay V1 run with a very minimal naive blob-oriented collection API, but it still feels relevant to considering where tasl starts and ends.

I think our goal here should just be distinct names for each of the four parts. They could all be related - I’m not opposed to “Underlay Data Model” / “Underlay Schema Language” / etc. Super concretely, we need canonical file extensions for text schemas (right now that’s .tasl), binary schemas (right now that’s .schema), and binary instances (right now that’s .instance).

2 Likes

and for the HTTP API we’ll want to pick MIME types for all three too - like application/x-ul-schema, application/x-ul-instance, …

This feels obvious in retrospect but incredibly clarifying given the history and evolution of thinking.

The experience with R0 rings true on all the points you make, and I find it helpful to walk through how and why the thinking changed over time.

The further we get in building out R1 and its use case, the more the binary serialization starts to feel akin to a database dump.

  • It’s not the primary way anyone will interact with the data
  • It’s potentially useful as a canonical backup/transfer filetype (which you want to be as compact as possible)
  • For very large datasets, a single backup file likely isn’t practical and you’d have backup infrastructure rather than a backup file (i.e. you’d talk to the Collection API).

I agree that each of the four parts are important for a complete, mature system - but it feels like we can get really far with just (1) on it’s own. And after that, adding (4) will get us even further. (2) and (3) may then come in to add maturity to the system, but likely won’t lead to a shift in primary uses or applications.

Given this point, I could also see a future that has multiple schema languages that sit on top of the data model and produce a binary schema serialization. The fact that there could be multiple flavors is part of why it makes sense to me for tasl to be its own named thing. There could also be vasl - verbose algebraic schema language, or any number of approaches for different applications. Do you think there is a path where the concrete pieces are (1), (2), and (4) with the schema language being like applications/registries? That is, clearly an important piece to making this stuff real, but not part of the “core”.

I’m all for direct naming of things:

  • UL data model
  • UL binary schema .ulbs or .ubs
  • UL binary instance .ulbi or .ubi

Two quick notes

I should have mentioned one other use of the binary serialization - we need it for hashing! Maybe this ends up being the primary / only real use, but it’s going to be a core part of the spec (the HTTP API is necessarily going to “depend” on it).

Yeah, I think there will be a variety of ways of producing binary schemas, and maybe some of them will look and feel like languages. But I still expect that we’ll want one that is both a) general and b) minimal. If nothing else, I want it for documentation - there has to be “a way to write out schemas” that people are able to read.