Computational Advances in Data-Consistent Inversion

Measure-Theoretic Methods for Improving Predictions

Motivation

  • Ideas beget prototype notebooks/scripts
  • Spawned other branches of development
  • Eventually would attempt to scale code
    • school servers, office workstation
    • cloud (??)



FAILURE TO RUN

A Pattern

  • Prototype script for idea: ~10-20 lines
  • Production-grade class/module: ~200-500 lines

or,

  • Figure generates on Mac OSX, crashes on Linux
  • Figure generates on Linux, looks weird on Mac OSX

etc.

Motivation

Result: Began studying Software Development

  • Dependencies, Packaging
  • Environment, Reproducibility
    • Images, Containers, Docker
    • Registries, Cloud Solutions
  • Unit Testing
  • Continuous Integration/Deployment
  • Organization of Git-Based Workflows

What if…?

One applied best practices for software development to all aspects of the dissertation process?

What if…?

  • Document (PDF) itself can be re-created

    • resolve LaTeX dependencies
    • package into Docker container
  • Examples/Figures/Tables Completely Reproducible

    • resolve Python dependencies
    • re-usable command-line Python scripts
    • provide convenience bash scripts

What if…?

  • Novel math is implemented as open-source code

    • update, extend, and enrich BET
    • enables thesis examples
    • ensures robustness (unit testing)
  • Package an All-in-One Docker Image

    • host in cloud-based registries (DockerHub)
    • original build files publicly hosted (GitHub)
    • thesis repository contains full instructions

Order of Operations

  1. Resolve environment for thesis repository
  2. Get BET up-to-date
  3. Add features to BET implementing new approach
  4. Make more user-friendly, capable of automation
  5. Write scripts for each example in thesis
  6. Assured of results, write them up
  7. Ensure thesis still compiles, generates figures

A Reproducible Thesis

  • One-click launch in-browser via Binder
  • Executable scripts: #!/bin/bash, chmod +x
  • Minimal tool-set in Dockerfile, options available
    • include command-line tools (e.g. image resizing)
  • Dockerfile for LaTeX (standalone), no figure-creation
  • Dockerfile for Python+LaTeX (reproducibility)
  • Dockerfile for full-stack environment (JupyterLab)

BET

[Butler, Estep, Tavener] Method

+ Mattis, Graham, Walsh, Pilosov

github.com/UT-CHG/BET

About BET

  • Purpose: Implement Measure-Theoretic Methods
    • Data-Consistent Inversion
  • GNU General Public License
  • First released 2014
  • Python 2.7, upgraded to 3.6+
    • Tested weekly on 2.7, 3.6, 3.7
  • Fully documented

Documentation

BET is auto-documented using a tool called Sphinx

r"""
This is a description of the method.
All sorts of formatting options are understood by Sphinx.
(kind of something to learn on its own, but usually you
can make sense of syntax by looking at existing ones)
"""
  • Comments that are formatted inside blocks are converted into web-page documentation
  • Updating docs through command-line
  • GitHub pages used to host the resulting output

Unit-Testing, Coverage, Versioning

  • nose is a Python framework for unit testing
    • desirable: compatible with unittest
  • codecov works alongside it as a tracker
  • setup.py contains versioning information
    • pip install ., or
    • `python setup.py install
    • Major/Minor releases ~ extent of changes
    • The third number in v1.2.3 is for incremental changes, such as bug-fixes, typos, patches, etc.

UNit Testing with Nose

  • One-to-one file structure
    • one test (class) for each sub-module method
    • multiple “tests” within each
  • Class consists of several methods
    • setup and teardown required
    • test for each function within module
  • Each test should anticipate mixtures of arguments that could be passed to each function

Unit Testing

from unnecessary_math import multiply
 
def test_numbers_3_4():
    assert multiply(3,4) == 12 
 
def test_strings_a_3():
    assert multiply('a',3) == 'aaa' 
nosetests -v test_um_nose.py

Source

Continuous Integration

  • Cloud instance carries out instructions
  • Travis runs when you submit a PR to check that everything works
  • GitHub checks for ability to merge automatically
  • Passing does not ensure a PR is merged
    • Ultimately up to the administrators of the repo
  • Helpful for contributors to debug before admins take a look
  • Used to prevent broken master branches

Upgrading BET

  • Python 2 support ending by 2020
  • Used a tool called 2to3
    • Takes care of most major changes
    • Two weeks fixing tests
  • Fixed tests for CI pipeline (Travis)
    • only handled numprocs=1,2
  • Addressed matplotlib upgrades, warnings
    • Released via PR as v2.1.0

Enhancing BET

  1. Ability to measure accuracy of solutions.
    • ensure it will be future-compatible
  2. Sampling-based approach
    • ensure ability to switch methods
  3. Handle data-driven methods
    • be capable of loading/transforming data
  4. Automate some decisions, defaults for users (WIP)
  5. Update documentation, installation options (WIP)
  6. Publish to PyPi, Anaconda

Novel Theoretical Advancements

  • New framework based on Bayes’ Rule
  • New framework for “parameter identification”
  • Motivates different user experience with code
    • define “initial” assumptions
  • “Consistent Bayes” -> “Data-Consistent Inversion”

Data-Consistent Inversion

  • In directions informed by data,
    • “turn off” regularization
  • Use “initial” distribution to regularize in the nullspace of the QoI map
  • Existence, Uniqueness, Stability given by Disintegration Theorem

Connection to Deterministic Optimization

$$\pi^{up} = \pi^{in} (\lambda) \frac{\pi^{ob} (Q(\lambda)) }{\pi^{pr} (Q(\lambda)) }$$

  • Given a linear map, full rank, and
    • Gaussian prior/initial
    • Gaussian likelihood (1 datum)

There exists a connection to Tikonov regularization.

img

New Developments

  • Parameter Identification
    • “data-driven maps”
  • Closed-form solution for linear maps
  • Iterative Algorithm
    • sequential projections onto solution manifolds
    • gradient-free optimization
  • Classification (variant of Naive Bayes)
    • handles unequal class-representation
    • comparison still unclear

Challenges with Implementation

  • Implementation v.s. Theory: many nuances
    • various packages, approaches, optimizations
  • How to be efficient with parallel processors?
  • Where are the “memory bottlenecks?”
    • Can we trade off approximation accuracy?
  • Choice of QoI (columns) impacts solution quality
    • Allow flexibility to do feature-selection
  • Working on a project with wide scope
    • Work often siloed in development
    • Sensitivity/robustness analysis before MVP

Structure of Thesis (Pt 1)

1. Introduction & Motivations
      i. Preliminaries
     ii. Framework
    iii. Software Contributions

2. Background on DCI
      i. Notation
     ii. Set-Valued
    iii. Sample-Based
     iv. Software Contributions
      v. Illustrative Examples

3. Impact on Accuracy
    - sections mirror Chapter 2.

Structure of Thesis (Pt 2)

4. Data-Driven Maps & Consistent Inversion
      i. Stochastic Map Framework
     ii. Data-Driven Maps
    iii. Software Contributions
     iv. Numerical Results & Analysis

5. Research Directions
    - Fit in extensions
    - Work-in-progress, draft ideas
    - Approaches to approximation

6. References, Appendix

Status

  • Dissertation repository largely in place
  • Docker experience level improving
    • still missing minimal versions
    • one-click launch “working”
  • Architecture/Structure of thesis mostly settled
  • Some bash/python scripts completed for examples
    • Larger examples in notebooks, uncoverted
  • Still a lot of content to be written up

Status

  • Software going through final rounds of testing
  • Documentation for new features still missing
  • Some (new) tests failing in parallel
  • Haven’t fully integrated with mybinder.org yet
  • No progress on demonstration/example repository
    • plan: use thesis examples as basis for content

Fin

Links:

Homepage

Slideshows

BET