[Draft] R1 Collections and Pipelines

This doc likely needs more polish as it leaves a few threads dangling, so I’ll consider it a draft for now and continue to revisit.

When trying to build out a Collection MVP, we wind up revisiting many of the ideas and terms that have been with us since the early days of Underlay thinking. Things like Assertions and ‘transactions’ come up. One thing that has becoming increasingly evident is that creating a single Assertion (or the spec for a single assertion) is the relatively trivial part. The difficult thing is figuring out how to map and incorporate a set of assertions into a self-consistent lifecycle.

We could, theoretically, create a Collection right now using the older ideas we had for a collection.json file and an assertion-as-rdf-dataset. However, once it’s created, you’re rather stuck. We don’t know how to update it, change the schema, process it, or meaningfully inspect it. There is no life cycle we can describe, and thus it’s not necessarily any better than a statically uploaded .zip file.

For me, the exciting parts of the Underlay were the life cycle capabilities it described. Trying to enumerate these, I came up with the following list:

  • Collaborative contributions from multiple (potentially crowdsourced) authors
  • Meta-collections formed from multiple ‘child’ collections
  • Knowing where certain elements in a dataset came from (metadata about when it was added, by who, from what source)
  • Permanent versioning
  • Models for representing and handling disagreements within the data

These are not only the features that sound exciting to build with, but they also feel like necessary features to get towards the more egalitarian, unsiloed data world we all want. Increasingly, the foundations that seem to make these possible are 1) strict typing and 2) explicit curation. Without these, merging and fusing data (across versions or from multiple sources) is not only painful and manual, but often totally intractable. Such barriers lead to the status quo, where the only reasonable way to get something done is to build your own private data silo where you know you can dictate the decisions and structures to keep your data consistent.

Joel’s work on tasl gives us a pathway towards strict typing and mappings, however we still have many questions about how ‘curation’ works. How are decisions made? How are those represented? How do those fit into a day-to-day process of curating a dataset. For this, we’ve found it helpful to think about the general pipeline of data on R1. Here’s how that looks in my mind:

At either extreme, you have sources and destinations. In the middle you have a static, published Collection version. In between, you have a series of processes that can be picked and ordered depending on the author’s goals and the shape described by the Collection’s schema. Note that each Underlay schema may fit a number of data models (e.g. Relational, Property Graph, RDF, Column Store, etc) - and which data models it fits can be statically determined. For example, a schema may describe a shape that can be naturally mapped onto the relational data model. In such cases, table-based visualization and edit widgets could be presented and used through R1.

I believe that everything outside of the middle circle is R1. That is, the circle in the middle is described by the more generic Underlay protocol, and all the blocks around it are the tooling that R1 provides. R1’s efforts are focused on giving people a set of tools and processes to produce (and work with) Underlay collections whose source, trustworthiness, and origin story are richly accessible and inspectable.

That said, I think we can begin by focusing on a few simple pipelines, knowing that other processes and UX can be built as we grow. Here are some example pipelines that map some of the often discussed workflows:

Open Questions

There are many open questions this draft doesn’t address. Many of them we can talk through, but many I feel we’ll have a better hunch for once we start working with collections directly.

One area of questions focuses on how we capture pipeline information. I see huge value in having a process doc, essentially the audit log of everything that happened - what file was uploaded, using what tool, and then cleaned using X, etc. However, I’m unclear on when this information should live in the graph itself through provenance, versus just living in R1’s database. Alternatively, such a process doc could live in the collection.json or collection files directly (the same question pops up for whether we want Discussions on R1 to find their way into collection versions).

Capturing the history of pipeline steps plus the collection versions that they produce is a really interesting prospect to me. It hints at dbdb again, where you have explicit snapshots of data connected by (in dbdb’s case) an explicit script.

I don’t think we need to have answers to these now. We’ll likely want to let usage dictate some of them.

2 Likes

This is great!

:100::100::100:

I like this a lot, although I’d add that some of the sources and destinations are tunnels to other registries, and those interregistry pipes are part of the Underlay. I think of that inter-registry federation network as the (current lol) goal of “the Underlay”; the data model and typed pipes are the foundation for that infrastructure.

The diagrams are awesome. Seeing this visually is so compelling - we’ve been talking forever about “making process explicit” and whatnot and this feels like we’re taking it literally for the first time. Manual review is a real, literal node in a pipeline. You drag it around. You send its output to something else.

I think we start by building it just in R1’s database, and then slowly incorporating it into collections proper over time (the fact that no other registries exist buys us a lot of time for this 🥲). I want to be protective and conservative in the underlay protocol part; if we get too ahead of ourselves there we’ll regret it later.

Some more general comments:

Fleshing out the pipeline abstraction is going to be tricky. Lots of these things that feel generally like steps might be difficult to fit into a common interface. Take “table editing” for example - it feels like you should be able to make a pipeline

[csv file source] -> [csv mapping] -> [table editor] -> [collection]

or maybe if the pipeline interface is that the pipes only carry schema’d data, we would have to combine some of them

[csv mapping from file] -> [table editor] -> [collection]

So to use it, we upload a file into the first node, and it populates the editor. then we make our manual edits, and click “publish version”. Does the table editor node save its state, so that we can go back and make additional edits and publish more versions? I guess that seems like the natural thing. To me the aesthetic of “pipelines” has a general connotation of being reproducible, but I guess that will just be limited to be certain pipes. Different pipes have different properties like “functional” (or maybe “real-time” in the future), and different arrangements of pipes will preserve some of those and not preserve others.

I guess my point is that it’s going to be easy for “pipes” to metastasize into “arbitrary programs” and the more tight of an interface we can tie around them the better.

Literally, a typescript interface like

interface Pipe<
  Input extends null | APG.Schema,
  Output extends null | APG.Schema,
  InternalState
> {
  initialState: InternalState;
  next(state: InternalState, input: APG.Instance<Input>): [InternalState, APG.Instance<Output>];
  component: React.FC<{state: InternalState}>;
}

or whatever makes sense. Maybe picking a pretty minimal-but-also-somehow-representative set of initial pipes and designing a little interface for them should be our first priority?

I sort of meant to capture that in the single loop-back - but that was asking a lot from that one line. Here’s a more explicit representation of what I was trying to get at, which I think lines up with what you’re saying.

Great - this also feels much faster from a dev point of view as well.

The best-case example of something like this might be the model of connect middleware. If we had a simple contract (you can expect req, res, and next and we expect req and res back) based around a small number of primitive objects, that might be the right amount of flexibility and standardization.

I like this a lot.

LinkedPipes ETL offers one visual vocabulary for pipe steps, where you can have sources and destinations (includinf any side-effect steps that produce an output or connect to an external interface) appear anywhere in a pipeline-network: https://demo.etl.linkedpipes.com/

They have a changelog for changes to any part of a named pipenet. We should implement something like this, so there can be discussions about steps in the changelog.