This doc likely needs more polish as it leaves a few threads dangling, so I’ll consider it a draft for now and continue to revisit.
When trying to build out a Collection MVP, we wind up revisiting many of the ideas and terms that have been with us since the early days of Underlay thinking. Things like Assertions and ‘transactions’ come up. One thing that has becoming increasingly evident is that creating a single Assertion (or the spec for a single assertion) is the relatively trivial part. The difficult thing is figuring out how to map and incorporate a set of assertions into a self-consistent lifecycle.
We could, theoretically, create a Collection right now using the older ideas we had for a collection.json
file and an assertion-as-rdf-dataset. However, once it’s created, you’re rather stuck. We don’t know how to update it, change the schema, process it, or meaningfully inspect it. There is no life cycle we can describe, and thus it’s not necessarily any better than a statically uploaded .zip file.
For me, the exciting parts of the Underlay were the life cycle capabilities it described. Trying to enumerate these, I came up with the following list:
- Collaborative contributions from multiple (potentially crowdsourced) authors
- Meta-collections formed from multiple ‘child’ collections
- Knowing where certain elements in a dataset came from (metadata about when it was added, by who, from what source)
- Permanent versioning
- Models for representing and handling disagreements within the data
These are not only the features that sound exciting to build with, but they also feel like necessary features to get towards the more egalitarian, unsiloed data world we all want. Increasingly, the foundations that seem to make these possible are 1) strict typing and 2) explicit curation. Without these, merging and fusing data (across versions or from multiple sources) is not only painful and manual, but often totally intractable. Such barriers lead to the status quo, where the only reasonable way to get something done is to build your own private data silo where you know you can dictate the decisions and structures to keep your data consistent.
Joel’s work on tasl gives us a pathway towards strict typing and mappings, however we still have many questions about how ‘curation’ works. How are decisions made? How are those represented? How do those fit into a day-to-day process of curating a dataset. For this, we’ve found it helpful to think about the general pipeline of data on R1. Here’s how that looks in my mind:
At either extreme, you have sources and destinations. In the middle you have a static, published Collection version. In between, you have a series of processes that can be picked and ordered depending on the author’s goals and the shape described by the Collection’s schema. Note that each Underlay schema may fit a number of data models (e.g. Relational, Property Graph, RDF, Column Store, etc) - and which data models it fits can be statically determined. For example, a schema may describe a shape that can be naturally mapped onto the relational data model. In such cases, table-based visualization and edit widgets could be presented and used through R1.
I believe that everything outside of the middle circle is R1. That is, the circle in the middle is described by the more generic Underlay protocol, and all the blocks around it are the tooling that R1 provides. R1’s efforts are focused on giving people a set of tools and processes to produce (and work with) Underlay collections whose source, trustworthiness, and origin story are richly accessible and inspectable.
That said, I think we can begin by focusing on a few simple pipelines, knowing that other processes and UX can be built as we grow. Here are some example pipelines that map some of the often discussed workflows:
Open Questions
There are many open questions this draft doesn’t address. Many of them we can talk through, but many I feel we’ll have a better hunch for once we start working with collections directly.
One area of questions focuses on how we capture pipeline information. I see huge value in having a process doc, essentially the audit log of everything that happened - what file was uploaded, using what tool, and then cleaned using X, etc. However, I’m unclear on when this information should live in the graph itself through provenance, versus just living in R1’s database. Alternatively, such a process doc could live in the collection.json or collection files directly (the same question pops up for whether we want Discussions on R1 to find their way into collection versions).
Capturing the history of pipeline steps plus the collection versions that they produce is a really interesting prospect to me. It hints at dbdb again, where you have explicit snapshots of data connected by (in dbdb’s case) an explicit script.
I don’t think we need to have answers to these now. We’ll likely want to let usage dictate some of them.