Tasl schema langauge updates

Now that we’ve had a chance to write a few of them and that I’ve had to write a codemirror lexer and a textmate grammar, I think we’re in a good position to make some (informed) breaking changes to the tasl schema language. I want to do this now because it feels like we might be near the end of the just-us-using-this window, and there have been a couple issues that came up while writing grammars that should be addressed.

double-colons for class declarations

previously:

namespace ex http://example.com/

class ex:Person { }

now:

namespace ex http://example.com/

class ex:Person :: { }

reasoning: I want there to be more visual consistency among “statements that define a class” (ie edge and class declarations) that differentiate them from statements that only do things locally (ie namespace and type variable declarations). Pretend we added a list keyword similar to class that declares a linked list. Looking at this schema…

class ex:A <>
type Foo string
list ex:Bar dateTime
type bar {}
edge ex:Bar ==/ ex:Z /=> ex:A

… you can’t really scan this quickly and tell “what classes are declared by this schema?” But by adding a little more consistency to distinguish them…

class ex:A :: <>
type Foo string
list ex:Bar :: dateTime
type bar {}
edge ex:Z :: ex:Bar => ex:A

… now we know what to look for. Statements that declare a class will always follow the [keyword] [class uri key] :: [... special syntax] format.

I picked the double-colon because to me it has “export this to a global scope” connotations. I’m open to other tokens.

potential objections:

  • makes tasl even more colon-dense than it already was
  • ???

new edge declaration syntax

the point of changing the class statement syntax is to make all the class declarations (including edge statements) more consistent, so here we are.

previously:

namespace ex http://example.com/

class ex:Person { }

edge ex:Person ==/ ex:Friendship /=> ex:Person

now:

namespace ex http://example.com/

class ex:Person :: { }

edge ex:Friendship :: ex:Person => ex:Person

Personally I think this is a huge improvement; I regret trying to put the edge label in the middle between the source and target. As always, this expands to

class ex:Friendship :: {
  ul:source -> * ex:Person
  ul:target -> * ex:Person
}

potential objections:

  • should it be a double-length arrow ==> ?

new edge metadata syntax

previously (I don’t know if this was ever documented) you could also annotate edges with a type like this:

edge ex:Person ==/ ex:Friendship someType /=> ex:Person
edge ex:Person ==/ ex:Rivalry <> /=> ex:Person

which would add an additional ul:value component to the class:

class ex:Friendship :: {
  ul:source -> * ex:Person
  ul:target -> * ex:Person
  ul:value -> someType
}

class ex:Rivalry :: {
  ul:source -> * ex:Person
  ul:target -> * ex:Person
  ul:value -> <>
}

I like this feature a lot and think we should definitely keep it. For the new syntax, that looks like this:

edge ex:Friendship :: ex:Person =/ someType /=> ex:Person
ex:Rivalry :: ex:Person =/ <> /=> ex:Person

Notice that now I’m using just =/ for the first pipe segment instead of ==/ - this is something that I’d like feedback on… ==/ someType /=> looks a little more “balanced” because it has three characters in each segment, but it’s just more stuff to type and =/ someType /=> gets the same fundamental “message” across. I guess this is related to whether the unannotated edge statement should use => or ==>. Barring any compelling reason to use the longer ones I’m planning on just defaulting to the shorter => and =/. We could also potentially do something other than / but I’m pretty happy with it.

Just to clarify - this means there are two versions of the edge declaration syntax, one with a value annotation, and one without.

# valid; doesn't have a ul:value component
edge ex:Friendship :: ex:Person => ex:Person

# also valid; has a ul:value component
edge ex:Friendship :: ex:Person =/ someType /=> ex:Person

Mandatory newlines; no more semicolons

This is the most significant change I’m proposing. Previously we used semicolons to delimit product components and coproduct options…

type foo { ex:a -> string; ex:b -> string; ... }
type bar [ ex:i >- string; ex:j >- string; ... ]

… and all type expressions were whitespace-insensitive; ie you could always write any arbitrarily complex type on one run-on line if you wanted. This was done roughly by analogy to JSON and to the TypeScript type = declaration syntax.

Unfortunately there are some problems with this related to parsing URIs, which I could have seen coming but didn’t notice until I was started writing a TextMate grammar for syntax highlighting. Reference expressions like * ex:Person end in a URI, which means that often you’ll end up writing something like { ex:friend -> * ex:Person; ... }. But the semicolon is a valid character that can appear unescaped in URI path segments or fragments (it’s in the sub-delims group in RFC3986 - “reserved” but still valid in most places), which means ex:Person; needs to get parsed as a single token (a URI that happens to end in a semicolon).

I don’t think saying “you can only use X lexical subset of URIs in tasl” is on the table; if we’re using URIs we have to support URIs (I’m making a point of tokenizing the octects in an IPv6 hostname to demonstrate our commitment to this). “Any absolute URI is a valid token” is a fundamental organizing rule that I don’t want to break… which also also rules out commas and just about every other reasonable delimiter. I don’t want to require whitespace before the delimiter (which is the style that most of the examples are written in right now) because it’s a really nonstandard requirement and doesn’t look very nice.

(and this isn’t just a problem with reference types; it also shows up with the compact coproduct-of-units enum syntax [ex:a; ex:b; ex:c])

After playing around with lots of options I came to the conclusion that the simplest thing to do is just remove the delimiter altogether and require newlines for every product component and coproduct option; this makes tasl into a more blocky language similar to the way types are declared in Go, C, Rust, etc. This is, by a huge margin, the “safest” way of handling URIs, since whitespace is the only common character class that is guaranteed to not appear in them.

previously:

type hello [ex:a; ex:b; ex:c >- ? { ex:i -> integer; ex:j -> * ex:Person }]

now:

type hello [
  ex:a
  ex:b
  ex:c <- ? {
    ex:i -> integer
    ex:j -> * ex:Person
  }
]

Only empty products {} (and empty coproducts [], which are valid but essentially useless) can appear on one line, otherwise you need a newline after the opening one and before the closing one. You can still have arbitrary non-newline whitespace (ie spaces and tabs) anywhere. The optional operator stays on the same line, as in the example above.

# these are all valid --------------------------------------
class ex:A :: { }
class ex:B ::      {

         }
class ex:C :: [

ex:option1

                          ex:option2

]

# these are all INVALID ------------------------------------
class ex:X :: { ex:foo -> string }
class ex:Y :: { ex:component1 -> string
}
class ex:Z :: {

  ex:component1 -> string }

potential objections:

  • enums are a little more verbose because you can’t list all the options on one line
  • dramatically changes the feel of the language
  • ???

use regular left arrows for coproducts

I want to reverse my previous decision about <- vs >-.

previously:

type hello [ ex:a >- string; ex:b >- integer ]

now:

type hello [
  ex:a <- string
  ex:b <- integer
]

The only strong reason for >- over <- is that in some situations editors will auto-match <> brackets, and typing <- would “accumulate” a matching > character to the right of the cursor. This only happens by default when the language isn’t known (whenever we write a language extension for any IDE we’re able to tell it explicitly which sets of brackets to auto-close).

After thinking about it more, it doesn’t seem like this is a big enough concern to outweigh the simple fact that “<-” is just easier to remember. Products point right; coproducts point left. That’s it.


Feedback on any of this is welcome. Over the next few weeks I’ll be focusing on writing copy for tasl.io, moving the contents of the underlay/apg repo into one master underlay/tasl repo, and releasing new minor versions of all the libraries, so there’s no hard deadline but I would like to finalize these fairly soon.

Syntactic sugar causes cancer of the semicolon

eliminate syntactic sugar all together and you end up with LISP

what I would like to have is a

homoiconic Abstract Syntax Graph Processing Language capable of expressing semantics and intent

I found this paper interesting. I have the impression that there is a fair amount of commonality of intent here

Hypothesis

DSLs, Intentional Software, Knowledge Graphs
are converging in my mind

Here’s an alternative path that I thought about, decided against, and am now reconsidering:

The way that some RDF-universe languages handle URIs is to have two ways of writing them:

  1. a bare/prefixed/compacted form prefix:slug, where prefix has been previously declared as a prefix in the document
  2. a delimited/absolute/expanded form <http://some.full/uri/including/weird#charaters!##[]?/?;;=,>

since < and > are one of the few characters that can’t appear in any part of a URI. in this pattern, you can’t use prefixes inside angle brackets, and you can’t write a full URI directly. Having the angle bracket syntax justifies restricting the character set allowed in the slug in the compact form, since if people have URIs with characters outside that set they can still use them, just with the angle brackets.

Another nice thing about this is that it explicitly differentiates between prefixed things and not-prefixed things, which isn’t clear otherwise. e.g. just looking at some URI s3:foo/bar, it’s not locally clear whether it’s supposed to mean s3 as an actual URI scheme (which is relatively common) or whether s3 was previously declared as a namespace that expands to e.g. http://my-s3-bucket/ (which might also be relatively common). This is something that kind of annoys me about JSON-LD etc. It feels a little sloppy.

This format (having both prefixed and absolute URIs) would be nice for us because we could just limit slugs to be the unreserved charset from RFC 3986, meaning we could keep semicolons and avoid requiring newlines.

This all seemed really great to me for a while but I couldn’t figure out how to reconcile it with literal types (which is what we use angle brackets for right now). This made sense to me at the time and felt consistent in a satisfying way - { } were products, [ ] were coproducts, and < > were the primitives (<> for the URI type and <some:uri> for literal types). I felt a little bad about it being inconsistent with the way other RDF specs use angle brackets (ie for absolute URIs) but not terribly.

If angle brackets mean absolute URIs, then there are two options for literals:

  1. we find a different syntax altogether
  2. we re-use the same syntax. literal types just datatype URIs after all, so we just say that a URI (in either compact or expanded form) is a valid type expression, and it means “the literal type with that datatype”

So valid uses of literals would look like this:

namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

# a literal with datatype http://www.w3.org/2001/XMLSchema#string
type literal1 xsd:string

# a literal with datatype http://my.custom.domain?term=location
type literal2 <http://my.custom.domain?term=location>

This is ultra-convenient and ultra-simple but my big concern is that it it’s too easy to confuse with reference types.

for example…

namespace ex http://example.com/

class ex:Person :: {}

class ex:Book {
  ex:author -> * ex:Person
}

class ex:Store {
  ex:owner -> ex:Person
}

here, the type of an book’s author is a pointer to a person, but the type of a store’s owner is a literal type with datatype http://example.com/Person. In the current version of tasl (ie the one on R0 right now) the Store class declaration would be invalid syntax - bare URIs aren’t valid type expression (if you want a literal type you have to explicitly say so with angle brackets, and if you want a reference type you have to start it with a * and then you can write a URI).

I can’t really tell how big of a deal this is. I really don’t like how such a tiny and easily-overlooked typo can change the semantics of the whole type, and I especially don’t like that it’s a really hard error to catch since "a literal type with datatype http://example.com/Person" is a totally valid thing to say. If a person did make this error, it’s not really clear how or where they’d discover it. :confused:

Then again, maybe it’s not that big of a deal. In other languages, if you accidentally type string instead of *string then… that’s considered an “acceptable” error. idk.

I’d really like to know if other people thing this is a concern or not

1 Like

Thinking about this more - I’m now leaning towards switching URIs to the two-format scheme mentioned above and keeping semicolons

namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

class ex:Person :: {
  <http://schema.org/name> -> ? <http://www.w3.org/2001/XMLSchema#string>;
  ex:name -> xsd:string;
  ex:bar -> [ ex:option1; ex:option2 <- boolean ];
}

I think there is still a case for eliminating semicolons and requiring newlines on its own merits (ie independent of URI tokenization concerns); I guess it’s fundamentally about how the language should feel. Should it feel blocky or snakey?

For reference, here’s the same sample without semicolons

namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

class ex:Person :: {
  <http://schema.org/name> -> ? <http://www.w3.org/2001/XMLSchema#string>
  ex:name -> xsd:string
  ex:bar -> [
    ex:option1
    ex:option2 <- boolean
  ]
}

Having let this sit for a while, I feel more confident about

  1. incorporating the two different ways of writing URIs for class/component/option keys - an “absolute form” <http://some.full/uri> and a “compact form” namespacePrefx:suffix, where suffix can only be [A-Za-z0-9\-._~]+.
  2. requiring newlines and eliminating semicolons

I still feel a little uncomfortable with literal types. The natural choice is to have them be “bare URIs, in either compact or expanded form”, but I keep thinking I’d rather have some kind of additional syntax around them. I don’t have any good ideas though.

I’m also rethinking the namespace declaration syntax. Should the “value” of a namespace be written using the same expanded URI syntax - ie wrapped in angle brackets? I can see a case for (“have a consistent way of writing URIs everywhere they appear”) and a case against (“namespaces are prefixes, which are not exactly URIs themselves; angle brackets should be used to point to specific terms, which the namespace itself doesn’t per se”).

Something that complicates the namespace question - again a thing I didn’t discover until writing a formal grammar - is that writing bare namespace URIs (without angle brackets) can get confused with comments. For example, how should this be parsed?

# this is definitely a comment
namespace foo http://example.com/foo/ # this is definitely another comment
namespace bar http://example.com/bar# is this a comment?
namespace baz http://example.com/baz/# is this a comment??
namespace xsd http://www.w3.org/2001/XMLSchema## and is this a comment??

We could definitely pick some exact rule for how to handle these cases. I think the right thing would be to say that the namespace URI token must end in exactly one of [/?#], so in the last example we would get a token http://www.w3.org/2001/XMLSchema# followed by a token # and is this a comment?.

But it’s a weird situation and is liable to cause bugs if other people implement parsers, which is absolutely something we should be planning for. Most languages don’t let their line comment token appear outside of some delimited context (like strings).

The “safer” thing to do is to wrap namespace values in angle brackets, like this

# this is definitely a comment
namespace foo <http://example.com/foo/> # this is definitely another comment
namespace bar <http://example.com/bar#> is this a comment?
namespace baz <http://example.com/baz/># is this a comment??
namespace xsd <http://www.w3.org/2001/XMLSchema#># and is this a comment??

where now we can clearly tell where the token boundaries are supposed to be. For reference, SHEX does this and uses colons:

PREFIX school: <http://school.example/#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ex: <http://ex.example/#>

… and Turtle does this, and uses colons (and line stoppers):

@base <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rel: <http://www.perceive.net/schemas/relationship/> .

I don’t really know how this makes me feel though, since you could also say that if we use angle brackets that we should then also use colons to make it consistent with the other RDF languages. But then the namespace declaration syntax is just loaded with syntax, when we started out so pure! Compare

namespace rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

to

namespace rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

the bare URI looks so nice! But I don’t know if it’s worth the danger of colliding with comments.

Some other ways out of this:

  1. change the line comment token to something else (bad idea i think)
  2. require that comments be on their own line with nothing else on it (maybe???)
  3. say that just namespace declarations can’t have comments (weird)

thoughts appreciated

Yet another self-consistent path would be to say this:

  • newlines required, eliminate semicolons
  • every URI must come from an explicitly declared namespace
    • there’s no “expanded form”
    • URI terms look like prefix:suffix, where now suffix can be any character in the pchar class from RFC 3986 ([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2}), plus / and ?
  • line comments # fjdkas... can only appear on their own line
  • literal types are written as a URI in angle brackets <prefix:suffix>

Some general comments about the situation:

  • I expect people to (intend to) write literal types very rarely
  • the angle brackets are the ugliest of the brackets and it would be nice to minimize them
  • tasl is fundamentally “all about URIs” and making them both easy and understandable should be the highest design goal. In other words, tasl’s biggest design challenge is grappling with the complexity of URIs.
  • URIs can include lots of characters - basically everything except whitespace, double-quotes, and angle brackets

Supporting both absolute and compact forms seems useful and welcome to those who might expect it. Supporting angle brackets doesn’t seem ugly to me; it’s a convenience for just this use case built into the URI spec.

I’m still ambivalent about the restrictiveness of requiring newlines; it’s a bit restrictive. Same-line comments help greatly w/ visibility (esp. if newlines are required!), I would keep them.

The simplicity & compatibility of this approach appeals.
I’m not sure how important avoiding this particular confusion would be.

Why not just expect the URI token to end at the next non-URI char?

Along the above lines, I’d say in the first instance [no angle brackets] the first two are comments, the next three are not. In the second instance [w/ <>], which I prefer for its clarity, all but the middle example are comments.

1 Like

yeah, me neither honestly. But I just know that somebody will accidentally write

namespace ex http://example.com/Person

class ex:Person {
  ex:name -> string
  ex:mother -> ex:Person
}

… since that’s actually the most natural thing to write / that’s what you’d guess. And it will be valid tasl, hard for them to figure out why their schema doesn’t work, etc.

in the grammars that I have now, I’m pretty strict about what namespace URIs are allowed to be. They’re specifically URIs that end with an empty query component ?, an empty fragment component #, or an empty path segment / - and not just “a URI that ends with one of the characters [/?#]”, but actually parsing the full URI structure (ie not allowing http://example.com#ns/).

Under these rules, there are two choices for parsing namespace http://example.com## hello world - either it doesn’t parse, or it parses as ["namespace", "http://example.com#", "# hello world"].

It feels wrong to have a language where appending the comment token to the end of a line might or might not actually make a comment. I can’t think of any examples of this in existing languages. Then again I don’t think there are examples of line-only comments either, but I’d rather err on the side of restricted consistency than permissive inconsistency. We can always add stuff back in later versions.


yet another galaxy-brain option is to not have a syntax for literal types at all, and make people declare their literal datatypes with a special kind of statement, similar to namespaces. So instead of this:

namespace ex http://example.com/
namespace rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

type Person {
  ex:foo -> rdf:JSON
}

… to make people do this:

namespace ex http://example.com/
namespace rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

# `literal` (or `datatype` or w/e)
# as a new top-level keyword statement *only* used 
# for binding literal datatypes to type variables
literal JSON rdf:JSON

type Person {
  ex:foo -> JSON
}

this is actually kind of cool because “what custom literals, if any, does this schema use?” is one of the most important/relevant questions people will have about a schema, and it would actually be awesome to have it highlighted like this. Probably not a good idea overall but worth stewing on.

i kind of like the galaxy brain version, namely because it forces you to understand a concept that might otherwise be quite easy to overlook + also makes the act of using a custom literal a lot more intentional. e.g. eliciting the question ‘is this really a good custom literal?’ (b/c using one requires explicit declaration)

2 Likes

I’m also liking the explicit statement definition for literal types; I think we should move forward with it.

namespace rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
literal JSON rdf:JSON

it’s a little repetitive but I think it’s the right thing.


One thing I’ve been thinking about more is that it probably makes sense to include a “multi-valued property” statement.

We have the edge declaration syntax

edge ex:Friendship :: ex:Person => ex:Person

… which expands to

class ex:Friendship {
  ul:source -> ex:Person
  ul:target -> ex:Person
}

… but really this is just a degenerate case of a multi-valued property. e.g. if we wanted to model a person with zero or more names, we have to write out

class ex:Person :: {}
class ex:Person/name :: {
  ex:source -> * ex:Person
  ex:value -> string
}

… which is essentially the same as the edge case. In other words, “edges are just multi-valued properties where the value happens to be a reference to another class”. I still think we should have an edge statement but it’s strange to have that but not a syntax for general multi-valued properties, which are just as (if not more) common.

What keyword should this statement use? Unfortunately there’s not single word for “multi-valued property”. I think we could honestly just use property:

class ex:Person :: {}

property ex:Person/name :: ex:Person => string

# this expands to
# class ex:Person/name {
#   ul:source -> * ex:Person
#   ul:value -> string
# }

and just let the double-ness of the arrow => suggest the mutli-valued-ness.