Provenance example: a scraped recipe collection

Context for this example:

  • Provenance is open ended and recursive
  • It should be easy to state provenance, and to find + traverse such statements
  • A realistic example helps visualize what is happening as multiple people interact with a collection
  • Uses:
    • Curators may use this to verify and correct collections
    • Others may use this to attempt to replicate a process or conclusion
    • Overlays may use this to apply their own source analysis in deciding what to show or rely on

At a minimum, this should include:
: publisher - who published this to the registry (signatory)
: authors - the proximal source for the collection as a whole, and for its elements
: sources - where the author isn’t the primary source, what other sources were used†
: place + time - where possible, capture this context (e.g. within R1: timestamps, network addresses)
: revision history - where possible, starting w/ other revisions of the same collection

Or a positive indicator they are the primary source.

The Recipe Magnet collection

  • Bo has been scraping recipes from two websites (Wi) into a collection. (v1 published as CL1.1 on Bo’s site at URx)
  • Bo discovers a neat ‘Recipe Magnet’ scraper script (S, URs) maintained by Aya, and uses it to generate a new revision (CL2.1, URy).
  • Cas, working with Bo, uploads CL2.1 to R1.

Some elements of data- and process-provenance we might care about in the future, for the entire collection:

  1. R1 automatically captures some registry-specific publication data:

    • Who published the collection to R1, when, where, past versions. (+ creation date?)
    • (User:Cas, now, from 10.0.0.42 using r1-update, link to CL1?)
  2. A simple cite-web schema could include data on where something was found on the web.

    • Author, source URL, source publication date, retrieved-on date.
      (Bo, URy, last month, now)
    • There are many citation schemas to draw on.
  3. A cite-process schema could include data on how it was produced.

  4. Aya could define a custom S-evocation schema for capturing their evocation of S (which can be used/referenced in the cite-process template).

    • Script URL, version, config; target websites [(name, URL)]
      (S, 0.8.12, param-string, W[])

Enriching provenance over time

I imagine this happening in stages, particularly when deciding what schemas apply and which parameters to fill out. One way this could proceed:

  1. The registry publication-schema is captured on upload.

  2. Bo + Cas discuss whether to source the upload to a) the existing URL, b) a description of the process, or both. They update the collection to include partial data for both. (They can’t remember the retrieved-on date or the exact source URLs used. In fact the cite-process style guide is fuzzy on when to include a source in the list of source URLs. But they take a stab.).

  3. After getting feedback from an interested replicator, Bo rerenders the collection and captures the exact configuration in an S-evocation, adding that as a new version (CL2.2). Minor changes to the data, revised process prov.

  4. After discovering a bug in the scraper affecting 10% of recipes, it is fixed and rerun on that subset, producing a new version of S and a revised collection (CL2.3).

Provenancing individual recipes

We also want to know where each recipe comes from. Some elements of context we may care about:

  1. collection and registry context: Collections it is in, when it was created, other versions (as the details of that recipe are refined).

  2. cite-web details, just for that recipe: Author, source URL, publication date, retrieved-on date
    (different for each recipe. parts of this data not all available for some recipes)

  3. cite-process details, for the recipe, including the details of S-evocation, and which steps in that process applied to this recipe. Sometimes a blanket collection-level process summary suffices – a single evocation produced the entire output. But if S handles extraction, reconciliation, deduplication, and merging, or more complex modeling, you might want it to track which of these (using which sources + rules) produced each recipe in the result.

What else?

Is this entire collection a single assertion? When would you have separate assertions for each recipe?
Does some of this provenance vary by recipe, and need to be stored at that level of granularity?

  • Part 1 and some of part 3 are the same across the collection, and don’t need repetition.
  • The data schema for a recipe may include key parts of 2: author, source URL, publication date.
    A hard-core recipe site’s schema might include recipe versioning + variants.

This leaves details (retrieved-on date, archive-url; description of applicable S steps; long-tail fields in cite templates [which can often be autogenerated from the URL]) that are recipe-specific context, which you might want to know when referring to the recipe, and are not captured in the default recipe data-schema. But these are more optional: the common use cases noted at the start do not rely on these.

Let’s build up to formally describing a collection with multiple recipes from multiple sources.
1: a one-recipe collection.
2: two recipes from one source.
3: two recipes from two sources.
4: two one-recipe collections with different processes.

Here’s a draft (abusing file formats a bit for concision, will revise):

1. A one-recipe collection: Okonomiyaki

okonomiyaki_collection.toml

(riffed from this)

format = "ul://pseudo-coll.1"
namespace = "ul://r1/recipes",

name = "recipeMagnet/okonomiyaki",
id   = "ab36a43e-53e9-46b8-a8f0-2bcff1909bdc"

collection-version = "0.1",
schema = "recipe-v4.toml" 
provenanceSchema = "sp_prov-v5.toml",

assertions = "okonomiyaki.nq"
files  = "images/okonomiyaki.jpg"

recipe-v4.toml

format = "ul://pseudo.1"
namespace = "ul://recipes"
version = "4"

[classes.Recipe]
name = { kind = "literal", datatype = "string", cardinality = "optional"}
url  = { kind = "literal", datatype = "string", cardinality = "optional"}
description = { kind = "literal", datatype = "string", cardinality = "optional"}
recipeText  = { kind = "literal", datatype = "string", cardinality = "optional"}
ingredient  = { kind = “reference”, label = “Ingredient”, cardinality = “any”}
step = { kind = “reference”, label = “Step”, cardinality = “any”}

A recipe may come in many formats, and should not be entirely empty, but it’s hard to say any one of these fields is required.

[classes.Ingredient]
id = { kind = “uri”, cardinality = “required”}
description = { kind = "literal", datatype = "string", cardinality = "optional"}

[classes.Step]    
description = { kind = "literal", datatype = "string"} # has internal structure, not captured here
stepOrder   = { kind = "literal", datatype = "integer", cardinality = "optional"}

steps may reference environment, tools, ingredients, and intermediate combinations.
many steps can share the same step number (the more sous chefs, the better).

sp_prov-v5.toml

A simple schema for source an process provenance.
For this first pass: let’s just accept any dict, to ensure we validate. For instance, the scraper might change what source-cite template/schema it uses based on the filetype; and it might change how it self-reports its process from version to version.
Names are optional shorthand for referring to source + process, and can be used to indicate similarity across different collections.

format = "ul://pseudo.1"
namespace = "ul://provenance"

[classes.SourceCite]
name = { kind = "literal", datatype = "string", cardinality = "optional" }
cite = "json" 

SourceCite could be any citation template. For a source website, this might include:
author, source URL (scraped by this process), source publication date, previous version, archive-url

[classes.ProcessCite] 
name =  { kind = "literal", datatype = "string", cardinality = "optional" }
cite = "json" 

ProcessCite could be any process-provenance template. For a scraper this might include:
description, name, version, author, source URL, parameters,
secondary URLs (relied on by this process),
name of collection curator, publication date

Other files

okonomiyaki.nq – the assertion itself, including:

  • the ingredients and steps of the recipe
  • how + when + from where it was extracted
  • how + when + from where it was published as a collection
  • the authors of the recipe, the script, + this compiled .nq & .ulc

okonomiyaki.jpg – an image to represent the collection

2. Two two-recipe collections: Okonomini(2) and Okonomihi(1+1)

2a: Shrouded in Okonomini

Here we take two different Okonomiyaki recipes from the same site, scrape them in similar ways, and bundle them together in a single one-assertion collection.

okonomini_collection.toml

format = "ul://pseudo-coll.1"
namespace = "ul://r1/recipes",

name = "recipeMagnet/okonomini",
id   = "yotta-yotta"

collection-version = "0.1",
schema = "okonomi_recipe-v1.toml" 
provenanceSchema = "sp_prov-v5.toml",

assertions = "okonomini.nq"
files  = ""

okonomi_recipe-v1.toml

format = "ul://pseudo.1"
namespace = "ul://recipes"
version = "1"
import = { url = "http://undr.ly/s/recipes",  version = "1.1 or higher" }     # does this break?

[classes.Recipe]
name = { kind = "literal", datatype = "string", cardinality = "optional"}
url  = { kind = "literal", datatype = "string", cardinality = "optional"}
description = { kind = "literal", datatype = "string", cardinality = "optional"}
recipeText  = { kind = "literal", datatype = "string", cardinality = "optional"}
ingredient  = { kind = “reference”, label = “Ingredient”, cardinality = “any”}
step = { kind = “reference”, label = “Step”, cardinality = “any”}
topping     = { kind = "reference", label = "Ingredient", cardinality = "any"}

Steps and Ingredients are now imported from a common Recipe schema.
Calling out the topping separately.
How do I say “one or more” Ingredients?
This is a variant of recipe.toml, and could just inherit from it and add the last line. How do I indicate that?

sp_prov-v5.toml

Still leaving this generic for now, rather than having a custom schema.
But including it here with different commentary.

format = "ul://pseudo.1"
namespace = "ul://provenance"

[classes.SourceCite]
name = { kind = "literal", datatype = "string", cardinality = "optional" }
cite = "json" 

We’re following the “web cite” citation template. This can include:
author, source URL (scraped by this process), source publisher, source publication date, previous version, archive-url
Most of these are different for each recipe. Only “source publisher” is the same for both recipes from a single site.

[classes.ProcessCite] 
name =  { kind = "literal", datatype = "string", cardinality = "optional" }
cite = "json" 

ProcessCite for our Recipe Magnet scraper, compiling + posting to R1, includes:
description, name, version, author, source URL, parameters,
secondary URLs (relied on by this process),
name of curator, registry publication date

Other files

okonomini.nq – the assertion of 2 recipes from one site, including:

  • the ingredients and steps of the two recipes
  • how + when + from where they were extracted
  • how + when + from where this was published as a collection
  • the authors of the recipes, the scraper, + this compiled .nq & .ulc

okonomini.pdf – an illustrated walkthrough of those recipes

2b: Okonomihi

Here we take two different recipes from different sites, using different parameters of the same scraper, and bundle them together as a one-assertion collection.

Future questions:
~ How could we have made 2a or 2b as two-assertion collections?
~ How would this change if the sources were two one-recipe collections compiled into this format from the source websites, rather than sites directly?

(*coming soon*)

okonomihi_collection.toml

Provenance considerations

Other files

okonomihi.nq – the assertion of two recipes from two different sites, including:

  • the ingredients and steps of the two recipes
  • how + when + from where they were extracted
  • how + when + from where this was published as a collection
  • the authors of the recipes, the scraper, + this compiled .nq & .ulc

Two quick notes about provenance modeling (that I don’t think I’ve been very clear about) is that

  • a collection specifies a provenance key (one of the labels from the provenance schema)
  • that key is not interpreted as “this is the provenance of the data in the graphs it’s attached to” but actually instead “this is the graph its attached to”

In other words, there’s no implicit “is attributed to” in the provenance relation. A typical provenance key won’t be a “primary source” class (like something that has a “url” property) or a “derivation process” class (like something that has a script link) but rather an “entity” class, where that entity represents the collection of triples in that named graph as a digital resource. From there, you might have properties like entity.wasDerivedFrom or entity.wasAttributedTo.

(this may have already been clear, I couldn’t really tell)

1 Like