Tasl schema langauge updates

Now that we’ve had a chance to write a few of them and that I’ve had to write a codemirror lexer and a textmate grammar, I think we’re in a good position to make some (informed) breaking changes to the tasl schema language. I want to do this now because it feels like we might be near the end of the just-us-using-this window, and there have been a couple issues that came up while writing grammars that should be addressed.

double-colons for class declarations

previously:

namespace ex http://example.com/

class ex:Person { }

now:

namespace ex http://example.com/

class ex:Person :: { }

reasoning: I want there to be more visual consistency among “statements that define a class” (ie edge and class declarations) that differentiate them from statements that only do things locally (ie namespace and type variable declarations). Pretend we added a list keyword similar to class that declares a linked list. Looking at this schema…

class ex:A <>
type Foo string
list ex:Bar dateTime
type bar {}
edge ex:Bar ==/ ex:Z /=> ex:A

… you can’t really scan this quickly and tell “what classes are declared by this schema?” But by adding a little more consistency to distinguish them…

class ex:A :: <>
type Foo string
list ex:Bar :: dateTime
type bar {}
edge ex:Z :: ex:Bar => ex:A

… now we know what to look for. Statements that declare a class will always follow the [keyword] [class uri key] :: [... special syntax] format.

I picked the double-colon because to me it has “export this to a global scope” connotations. I’m open to other tokens.

potential objections:

  • makes tasl even more colon-dense than it already was
  • ???

new edge declaration syntax

the point of changing the class statement syntax is to make all the class declarations (including edge statements) more consistent, so here we are.

previously:

namespace ex http://example.com/

class ex:Person { }

edge ex:Person ==/ ex:Friendship /=> ex:Person

now:

namespace ex http://example.com/

class ex:Person :: { }

edge ex:Friendship :: ex:Person => ex:Person

Personally I think this is a huge improvement; I regret trying to put the edge label in the middle between the source and target. As always, this expands to

class ex:Friendship :: {
  ul:source -> * ex:Person
  ul:target -> * ex:Person
}

potential objections:

  • should it be a double-length arrow ==> ?

new edge metadata syntax

previously (I don’t know if this was ever documented) you could also annotate edges with a type like this:

edge ex:Person ==/ ex:Friendship someType /=> ex:Person
edge ex:Person ==/ ex:Rivalry <> /=> ex:Person

which would add an additional ul:value component to the class:

class ex:Friendship :: {
  ul:source -> * ex:Person
  ul:target -> * ex:Person
  ul:value -> someType
}

class ex:Rivalry :: {
  ul:source -> * ex:Person
  ul:target -> * ex:Person
  ul:value -> <>
}

I like this feature a lot and think we should definitely keep it. For the new syntax, that looks like this:

edge ex:Friendship :: ex:Person =/ someType /=> ex:Person
ex:Rivalry :: ex:Person =/ <> /=> ex:Person

Notice that now I’m using just =/ for the first pipe segment instead of ==/ - this is something that I’d like feedback on… ==/ someType /=> looks a little more “balanced” because it has three characters in each segment, but it’s just more stuff to type and =/ someType /=> gets the same fundamental “message” across. I guess this is related to whether the unannotated edge statement should use => or ==>. Barring any compelling reason to use the longer ones I’m planning on just defaulting to the shorter => and =/. We could also potentially do something other than / but I’m pretty happy with it.

Just to clarify - this means there are two versions of the edge declaration syntax, one with a value annotation, and one without.

# valid; doesn't have a ul:value component
edge ex:Friendship :: ex:Person => ex:Person

# also valid; has a ul:value component
edge ex:Friendship :: ex:Person =/ someType /=> ex:Person

Mandatory newlines; no more semicolons

This is the most significant change I’m proposing. Previously we used semicolons to delimit product components and coproduct options…

type foo { ex:a -> string; ex:b -> string; ... }
type bar [ ex:i >- string; ex:j >- string; ... ]

… and all type expressions were whitespace-insensitive; ie you could always write any arbitrarily complex type on one run-on line if you wanted. This was done roughly by analogy to JSON and to the TypeScript type = declaration syntax.

Unfortunately there are some problems with this related to parsing URIs, which I could have seen coming but didn’t notice until I was started writing a TextMate grammar for syntax highlighting. Reference expressions like * ex:Person end in a URI, which means that often you’ll end up writing something like { ex:friend -> * ex:Person; ... }. But the semicolon is a valid character that can appear unescaped in URI path segments or fragments (it’s in the sub-delims group in RFC3986 - “reserved” but still valid in most places), which means ex:Person; needs to get parsed as a single token (a URI that happens to end in a semicolon).

I don’t think saying “you can only use X lexical subset of URIs in tasl” is on the table; if we’re using URIs we have to support URIs (I’m making a point of tokenizing the octects in an IPv6 hostname to demonstrate our commitment to this). “Any absolute URI is a valid token” is a fundamental organizing rule that I don’t want to break… which also also rules out commas and just about every other reasonable delimiter. I don’t want to require whitespace before the delimiter (which is the style that most of the examples are written in right now) because it’s a really nonstandard requirement and doesn’t look very nice.

(and this isn’t just a problem with reference types; it also shows up with the compact coproduct-of-units enum syntax [ex:a; ex:b; ex:c])

After playing around with lots of options I came to the conclusion that the simplest thing to do is just remove the delimiter altogether and require newlines for every product component and coproduct option; this makes tasl into a more blocky language similar to the way types are declared in Go, C, Rust, etc. This is, by a huge margin, the “safest” way of handling URIs, since whitespace is the only common character class that is guaranteed to not appear in them.

previously:

type hello [ex:a; ex:b; ex:c >- ? { ex:i -> integer; ex:j -> * ex:Person }]

now:

type hello [
  ex:a
  ex:b
  ex:c <- ? {
    ex:i -> integer
    ex:j -> * ex:Person
  }
]

Only empty products {} (and empty coproducts [], which are valid but essentially useless) can appear on one line, otherwise you need a newline after the opening one and before the closing one. You can still have arbitrary non-newline whitespace (ie spaces and tabs) anywhere. The optional operator stays on the same line, as in the example above.

# these are all valid --------------------------------------
class ex:A :: { }
class ex:B ::      {

         }
class ex:C :: [

ex:option1

                          ex:option2

]

# these are all INVALID ------------------------------------
class ex:X :: { ex:foo -> string }
class ex:Y :: { ex:component1 -> string
}
class ex:Z :: {

  ex:component1 -> string }

potential objections:

  • enums are a little more verbose because you can’t list all the options on one line
  • dramatically changes the feel of the language
  • ???

use regular left arrows for coproducts

I want to reverse my previous decision about <- vs >-.

previously:

type hello [ ex:a >- string; ex:b >- integer ]

now:

type hello [
  ex:a <- string
  ex:b <- integer
]

The only strong reason for >- over <- is that in some situations editors will auto-match <> brackets, and typing <- would “accumulate” a matching > character to the right of the cursor. This only happens by default when the language isn’t known (whenever we write a language extension for any IDE we’re able to tell it explicitly which sets of brackets to auto-close).

After thinking about it more, it doesn’t seem like this is a big enough concern to outweigh the simple fact that “<-” is just easier to remember. Products point right; coproducts point left. That’s it.


Feedback on any of this is welcome. Over the next few weeks I’ll be focusing on writing copy for tasl.io, moving the contents of the underlay/apg repo into one master underlay/tasl repo, and releasing new minor versions of all the libraries, so there’s no hard deadline but I would like to finalize these fairly soon.

Syntactic sugar causes cancer of the semicolon

eliminate syntactic sugar all together and you end up with LISP

what I would like to have is a

homoiconic Abstract Syntax Graph Processing Language capable of expressing semantics and intent

I found this paper interesting. I have the impression that there is a fair amount of commonality of intent here

Hypothesis

DSLs, Intentional Software, Knowledge Graphs
are converging in my mind

Here’s an alternative path that I thought about, decided against, and am now reconsidering:

The way that some RDF-universe languages handle URIs is to have two ways of writing them:

  1. a bare/prefixed/compacted form prefix:slug, where prefix has been previously declared as a prefix in the document
  2. a delimited/absolute/expanded form <http://some.full/uri/including/weird#charaters!##[]?/?;;=,>

since < and > are one of the few characters that can’t appear in any part of a URI. in this pattern, you can’t use prefixes inside angle brackets, and you can’t write a full URI directly. Having the angle bracket syntax justifies restricting the character set allowed in the slug in the compact form, since if people have URIs with characters outside that set they can still use them, just with the angle brackets.

Another nice thing about this is that it explicitly differentiates between prefixed things and not-prefixed things, which isn’t clear otherwise. e.g. just looking at some URI s3:foo/bar, it’s not locally clear whether it’s supposed to mean s3 as an actual URI scheme (which is relatively common) or whether s3 was previously declared as a namespace that expands to e.g. http://my-s3-bucket/ (which might also be relatively common). This is something that kind of annoys me about JSON-LD etc. It feels a little sloppy.

This format (having both prefixed and absolute URIs) would be nice for us because we could just limit slugs to be the unreserved charset from RFC 3986, meaning we could keep semicolons and avoid requiring newlines.

This all seemed really great to me for a while but I couldn’t figure out how to reconcile it with literal types (which is what we use angle brackets for right now). This made sense to me at the time and felt consistent in a satisfying way - { } were products, [ ] were coproducts, and < > were the primitives (<> for the URI type and <some:uri> for literal types). I felt a little bad about it being inconsistent with the way other RDF specs use angle brackets (ie for absolute URIs) but not terribly.

If angle brackets mean absolute URIs, then there are two options for literals:

  1. we find a different syntax altogether
  2. we re-use the same syntax. literal types just datatype URIs after all, so we just say that a URI (in either compact or expanded form) is a valid type expression, and it means “the literal type with that datatype”

So valid uses of literals would look like this:

namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

# a literal with datatype http://www.w3.org/2001/XMLSchema#string
type literal1 xsd:string

# a literal with datatype http://my.custom.domain?term=location
type literal2 <http://my.custom.domain?term=location>

This is ultra-convenient and ultra-simple but my big concern is that it it’s too easy to confuse with reference types.

for example…

namespace ex http://example.com/

class ex:Person :: {}

class ex:Book {
  ex:author -> * ex:Person
}

class ex:Store {
  ex:owner -> ex:Person
}

here, the type of an book’s author is a pointer to a person, but the type of a store’s owner is a literal type with datatype http://example.com/Person. In the current version of tasl (ie the one on R0 right now) the Store class declaration would be invalid syntax - bare URIs aren’t valid type expression (if you want a literal type you have to explicitly say so with angle brackets, and if you want a reference type you have to start it with a * and then you can write a URI).

I can’t really tell how big of a deal this is. I really don’t like how such a tiny and easily-overlooked typo can change the semantics of the whole type, and I especially don’t like that it’s a really hard error to catch since "a literal type with datatype http://example.com/Person" is a totally valid thing to say. If a person did make this error, it’s not really clear how or where they’d discover it. :confused:

Then again, maybe it’s not that big of a deal. In other languages, if you accidentally type string instead of *string then… that’s considered an “acceptable” error. idk.

I’d really like to know if other people thing this is a concern or not

Thinking about this more - I’m now leaning towards switching URIs to the two-format scheme mentioned above and keeping semicolons

namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

class ex:Person :: {
  <http://schema.org/name> -> ? <http://www.w3.org/2001/XMLSchema#string>;
  ex:name -> xsd:string;
  ex:bar -> [ ex:option1; ex:option2 <- boolean ];
}

I think there is still a case for eliminating semicolons and requiring newlines on its own merits (ie independent of URI tokenization concerns); I guess it’s fundamentally about how the language should feel. Should it feel blocky or snakey?

For reference, here’s the same sample without semicolons

namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

class ex:Person :: {
  <http://schema.org/name> -> ? <http://www.w3.org/2001/XMLSchema#string>
  ex:name -> xsd:string
  ex:bar -> [
    ex:option1
    ex:option2 <- boolean
  ]
}