JupyterHub

The End-User Experience

You go to a web-address and are confronted with the Jupyterhub login portal. Once you log in, you see a drop-down menu that allows you to choose what “stack” of software dependencies you would like to use. After you make your selection, you are redirected to an in-browser interface where all your files are available. Jupyter Notebooks, R Scripts, anything you want, can now be run directly through this interface.

The work is performed “in the cloud.” What this means is that a connection is established between your browser window (the client) and a server (the host) elsewhere that is capable of performing the calculations you need.

Benefits

You do not have to worry about installing software dependencies, since the administrator has already loaded them into the environment that you selected. The machine you are working on will not be doing anything except communicating commands to the host and displaying the result, which means your computer will run on less energy.

Moreover, it means that since all you need is a functional internet browser, you can do all of your scientific work with a budget laptop (or in a pinch, a phone/tablet). For academic institutions, this means that entire laboratories of computers can be re-purposed for better uses, since students will not be limited by the technology they can afford.

If you are an administrative user, you can add other users through a control-panel available through this interface, and even be able to see when these users are logged in. If configured by the system administrator who set all this up, you may also be able to start an environment as any of these users, and look around their folders/files, perhaps as a teacher helping de-bug some code during remotely-held office hours.

Sounds neat, doesn’t it? Especially for academics, this lowering of friction to programming is especially appealing. The hardest parts of setting up a class with the resources they need to start learning how to code have all been abstracted away for both the students and teachers who serve as administrators. All of this is already possible, and there are a lot of ways to set this exact thing up, which makes navigating how to do so a little difficult. We will discuss this later after building up the proper motivation.

Application Architecture

This separation of where work is performed and where its results are shown is actually essential to understanding how most modern “applications” are built. By not burdening end-users with installation dependencies that vary with the technology they own, application developers avoid worrying about cross-platform support. Most applications are now simply re-skinned versions of a browser called Chromium1, which direct you to the company’s servers where all the work is performed.

In this spirit, the Jupyter Project has done the same thing for scientists and programmers. The open-source community that powers most of the work done by Jupyter is largely concentrated at UC Berkeley, and the adoption of their work is spreading fast across the world. However, due to the rapid pace of development, a cohesive narrative that explains how all the technologies developed by the Jupyter Project are connected is hard to come by. We attempt to address this problem with this document, stepping through the layers of abstraction and building up a story that helps develop an understanding of how these underlying technologies interface and the many ways in which they can be configured2.

Jupyterhub

The management of users and the environments they are able to use is done through the interface known as Jupyterhub (the “Hub” for short). The Hub controls who is allowed to access a notebook server through an Authenticator. This can be a username/password combination, or a one-click login through a third-party such as Github, Gitlab, Google, Azure, or your university’s login page.

Jupyterhub passes information to an Authenticator, which returns a “successful” (or not) message to the Hub, which (if successful) then performs some action. In the example above, the action is performed was to first display a drop-down menu that allows a user to select an environment. (This can be bypassed if you want every end-user to have the same environment). Jupyterhub then “spawns” a “single-user (notebook) server” to establish the connection between computational resources and the user-facing interface through the browser3. The Hub launches such a server for each user that successfully logs in.

Jupyterhub is thus made up of:

  • An Authenticator (that lets people in or keeps them out),
  • a Spawner (which defines individual environments),
  • a Database (to store users), and
  • a Server (to connect all of it).

schematic

Notebook Server

At the heart of the system is the Jupyter Notebook (server). The notebook server communicates with the relevant processes on the host machine to carry out the computations that you (the client) asked it to perform through your browser interface. The machine running the notebook server has all the necessary dependencies to perform the work, and your end-user experienced is managed through the Jupyter notebook server.

The native file-format for the interface the notebook server is the “interactive notebook” (which has the .ipynb ending). All that these notebooks are is a collection of key-value pairs (dictionaries, linked lists, many words exist), each of which corresponds to one “cell” in the notebook and carries input/output information along with formatting specifications. Since it is best to see by example, here is an abstraction of what this looks like:

{"cell_1": {info} }
{"cell_2": {info} }
{"cell_3": {info} }

where each one of these {info} elements corresponds to a dictionary itself, which might look something like:

{"input": <>, 
 "output": <>,
 "execution_count": <>, 
 "kernel": <>, ...
}

All of this information is used to format an HTML page that is then rendered by a browser of your choice, since at this point it is simply a web page.

Now, it is important to note at this time that if you are using a computer that has all the required software dependencies you need, then you can become the “host” yourself, and run the notebook server locally. In the event that you do this, the website you visit to see the notebook server file-browsing interface and work with notebooks will look something like http://localhost:8000, where localhost is telling your browser that … well, the host is local, not somewhere remote4. The :8000 (the number can differ) is the port that is being opened on the machine to allow for input/output connections, which is what is handled by the server. (Remember this… it’ll come up later).

An Analogy

If the website is considered the “address” of the machine(s), the port is the “apartment number.” Many “tenants” (applications) can “live” (be hosted) at the same “address” (website). Some buildings choose to have a doorman, so you can ask for directions to the right place by name instead of apartment number. So if you do not see a port number (:XXXX) anywhere in the browser address, you can assume that the configuration implemented has a “doorman” (proxy) directing traffic to the right places.

This about sums up what you need to know about the notebook server. Somewhere there exists a machine that can do what you want. By running a notebook server on this machine, you can connect to it through a web-browser.

Spawning

There are many ways to start the (single-user) notebook servers. They can be started as individual processes on the same machine that the Hub is running on, which is the default behavior, and the implementation used in The Littlest Jupyterhub.

However, in the spirit of modularity, they can also be started on machines elsewhere in the world. These machines may be set up with all the dependencies, waiting for authenticated connections from Hub, or they can be created (and destroyed) on-demand. They can be physical machines, or “virtual machines,” which (as the name suggests) are emulations of computers that run as processes on top of some existing architecture.

Containerization

Virtual machines allow for something called “containerization,” which isolated applications in (usually Linux) virtual environments that contain little else but the bare necessities for running a given application.

To scale to millions of users, services exist that allow for the creation and destruction of virtual machines on-demand, with low latency, and pricing computed by the second. Such services sit atop a software platform called Kubernetes, and require a decent amount of technical experience to configure. Jupyterhub can be configured to “spawn” single-user notebook servers through such a service using an extension called “Kubespawner.” However, we do not provide any more information about this herein because the wiki for Jupyterhub is written specifically for this implementation case.

We focus on a solution that is capable of scaling to hundreds of users and can be built with existing server resources that a department may have that are underutilized, but is a little more technically challenging to implement than The Littlest Jupyterhub, which basically functions as a one-click install and provides its own set of configuration scripts.

Docker

A simpler solution though, can be achieved through Docker, which (although it has a learning curve), is capable of scaling to very large workloads and can be configured to connect multiple servers and balance traffic amongst them.

We choose to run Jupyterhub within a Docker container, the Database in another container, and then spawn single-user notebook servers each in isolated containers as well. The latter containers are ephemeral, which means that once a user shuts down their single-user server (or it is shut down due to inactivity after a predetermined amount of time), the container is destroyed. Containerizing every aspect of this project allows for a really simple deployment scenario, which we walk through on the Deployment page.

We can “containerize” the environments we want our users to experience in their single-user notebook servers through the use of a “Dockerfile,” which simply lists a set of instructions to configure a brand-new Ubuntu (Linux distribution) machine to handle all the dependencies.

The single-user notebook servers can be launched from a choice of “images,” which define the containers that are created/destroyed on-demand.

In this way, our applications are “stateless,” which means that all the information (files, data, configuration) is connected to the application virtually (Docker handles these connections through something called “volumes”. So, if properly configured, a student can launch any environment they want and still see the same set of files.

Database

A basic SQL database comes prepackaged and configured with Jupyterhub, but for “production” purposes, the developers in the Jupyter Project advise using a more resilient database such as PostgresDB, which is the solution we implemented, following the Deployment Example provided by Jupyter.

This postgres database will exist inside its own container and be connected to an external volume (managed by Docker), so that we can create/destroy the Hub and not lose any of our user data.

Authenticator

The Wiki has a list of authenticator options. The default behavior is to check the system users on the machine running Jupyterhub. In the jupyterhub_config.py file which is used to set up the Hub, a whitelist and admins attribute are defined to control who is allowed in. If inside a container, this means that users must be created inside of there! The nice thing about using an authentication service such as OAuth is that it simplifies the login entirely, since accounts exist on other platforms to verify users. The downside is that you must own a “fully qualified domain name” in order to do this, which means you must own the website (can be as low as $2-10/year, available via Namecheap).

Proxy

As mentioned in the analogy in earlier sections, the “doorman” that directs traffic around is the component of this application known as the “proxy”, and is the last part left in securing and configuring your server.

Volumes

Files and folders, as mentioned, persist on the machine that is hosting the application, and can even be configured (e.g., with symbolic links), to connect to file-servers elsewhere. My university, for example, has a system for all of us to be able to store files on a remote server, and we can “mount” these connections appropriately by configuring Jupyterhub so that students can more readily access such resources.


  1. which serves as the backbone for Google Chrome. ^
  2. We note that the reason that there are so many possible ways to set up such a system is because the developers ensured the modularity of all the application components. There is a lot of segmentation that allows for this technology to scale to thousands of users. The lessons learned by mobile developers about scaling in popularity have all been implemented here. ^
  3. The distinction is made for (single-user) servers since Jupyterhub itself is a server communicating with other applications/servers. ^
  4. I suppose this would be a good time to mention that any website you visit is simply an alias for an “address” to some other computer elsewhere in the world. The address system used on the internet is known as the “IP address”, so the localhost is telling the browser to look for a connection on the same machine. ^