Data and Model sharing

sj · August 20, 2020, 4:00am

Questions to answer to capture provenance and context. What else should we include? Currently:

0 Data structure

0.1 Schemas
0.11 What ontologies, schemas and shapes are used?
0.12 Are these defined in an overall spec?
0.13 Are the shapes + their specs versioned?

0.2 Formats
0.21 What file and data formats are used?
0.22 What database structures and formats are used?
0.23 Can these be easily converted to common shared formats?

1. Provenance

1.1 Sources
1.11 What sources are used; drawn from what set?
1.12 Are the sources versioned?
1.13 How often is each source pulled / pushed / otherwise updated?

1.2 Credit
1.21 Who was involved in producing and sharing data?
1.22 Is there CREDiT-style attribution for different roles?

2. Process provenance

2.1 Toolchains
2.11 What toolchains and pipelines are involved?
2.12 What upstreams contribute to this work?
2.13 How are changes to these workflows recorded?
2.14 When do changes trigger a recompilation?

2.2 Dependencies
2.21 Is there an explicit process- or workflow-dependency tree?
2.22 When does a stale dependency trigger a recompilation?
2.23 Are there any push options for updates, or flagging of critical updates?
2.24 Are there social, institutional, or environmental dependencies?

3. Reuse

3.1 Licensing and Embargoes
3.11 Is the data public?
3.12 If not yet public, is there an embargo period for public sharing?
3.13 Once public, under what license is it available?
3.14 Is the license compatible with the licenses of major contributing datasets, or other popular knowledge bases (such as Wikidata)?

3.2 Dumps
3.21 How are dumps provided: name, format, versioning?
3.22 Is there a feed of updates to dumps?

3.3 Logging use
3.31 What downstreams are using this work?
3.32 Is a log of this use visible kept, and at what level of detail?
3.33 Is this usage visible to other reusers, via pingbacks or other?

3.4 Remixing
3.41 Is this used, or planned for use, in any metastudies?
3.42 What processing (schema mappings, fuzzings or anonymization, other) is used for each including metastudy?
3.43 Is the mapping for use in any metastudy encoded in a named package or configuration file that others could use?

4. Data selection

4.1 Filters
4.11 How was data chosen for measurement/inclusion?
4.12 How is it noted when this changes?

4.2 Data cleaning
4.21 What data cleaning or noise correction, were used in compiling the data?
4.22 What other workflows were applied to the raw data?
4.23 How were these workflows registered before the raw data was gathered?
4.24 How are these workflows and pipelines named and versioned?

4.3 Parallels
4.31 What similar efforts or alternatives exist?

5. Replication

5.1 Replicability
5.11 What is the whole tale of your work – what environment and setup are needed to replicate it?
5.12 Is this articulated in a [whole tale] file?
5.13 Does this file include workflow + usage notes?

5.2 Replicatedness
5.21 Has your process been replicated in practice?
5.22 By how many independent parties has it been replicated?

6. Prediction

6.1 Registration
6.11 Which of the above were pre-registered with a registration service?
6.12 Which were registered or announced during the research, before its final analysis and conclusions

6.2 Change logs
6.21 Are there logs kept of changes to protocols, processes, and data cleaning?
6.22 Are there lab notebooks kept of the development of the research?

6.3 Confirmation target
6.31 Was there a public target the research was intended to confirm or verify?
6.32 How was confirmation bias avoided in the design and analysis?
6.33 How was forking-paths fallacy avoided?

7. Error

7.1 Estimation
7.11 How are errors and noise estimated for each process or observation?

7.2 Propagation
7.21 How are errors propagated through processes and combinations?
7.22 How are resultant errors in conclusions described or characterized?

Topic		Replies	Views
Provenance example: a scraped recipe collection Underlay recipes	3	388	October 26, 2020
A Better Tomorrow, Today: A 2022 Pipeline Fanfic Underlay	3	554	March 3, 2021
Schema imports and lifecycle Underlay	7	405	October 15, 2020
Minimum Viable Schemas Underlay	18	779	August 2, 2021
WikiCite: metadata, inbound links, and contextual relevance for every cited work Underlay completions , examples	1	529	August 2, 2021