Skip to main content

S3 Contents Manager for Jupyter

Project description

S3Contents - Jupyter Notebooks in S3

A transparent, drop-in replacement for Jupyter standard filesystem-backed storage system. With this implementation of a Jupyter Contents Manager you can save all your notebooks, files and directory structure directly to a S3/GCS bucket on AWS/GCP or a self hosted S3 API compatible like MinIO.

Installation

pip install s3contents

Install with GCS dependencies:

pip install s3contents[gcs]

s3contents vs X

While there are some implementations of an S3 Jupyter Content Manager such as s3nb or s3drive s3contents is the only one tested against new versions of Jupyter. It also supports more authentication methods and Google Cloud Storage.

This aims to be a fully tested implementation and it's based on PGContents.

Configuration

Create a jupyter_notebook_config.py file in one of the Jupyter config directories for example: ~/.jupyter/jupyter_notebook_config.py.

Jupyter Notebook Classic: If you plan to use the Classic Jupyter Notebook interface you need to change ServerApp to NotebookApp for all the examples on this page.

AWS S3

from s3contents import S3ContentsManager

c = get_config()

# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.bucket = "<S3 bucket name>"

# Fix JupyterLab dialog issues
c.ServerApp.root_dir = ""

Authentication

Additionally you can configure multiple authentication methods:

Access and secret keys:

c.S3ContentsManager.access_key_id = "<AWS Access Key ID / IAM Access Key ID>"
c.S3ContentsManager.secret_access_key = "<AWS Secret Access Key / IAM Secret Access Key>"

Session token:

c.S3ContentsManager.session_token = "<AWS Session Token / IAM Session Token>"

AWS EC2 role auth setup

It also possible to use IAM Role-based access to the S3 bucket from an Amazon EC2 instance or AWS resource.

To do that just leave any authentication options (access_key_id, secret_access_key) to their default of None and ensure that the EC2 instance has an IAM role which provides sufficient permissions (read and write) for the bucket.

Optional settings

# A prefix in the S3 buckets to use as the root of the Jupyter file system
c.S3ContentsManager.prefix = "this/is/a/prefix/on/the/s3/bucket"

# Server-Side Encryption
c.S3ContentsManager.sse = "AES256"

# Authentication signature version
c.S3ContentsManager.signature_version = "s3v4"

# See AWS key refresh
c.S3ContentsManager.init_s3_hook = init_function

AWS key refresh

The optional init_s3_hook configuration can be used to enable AWS key rotation (described here and here) as follows:

import boto3
import botocore
from botocore.credentials import RefreshableCredentials
from botocore.session import get_session
from configparser import ConfigParser

from s3contents import S3ContentsManager

def refresh_external_credentials():
    config = ConfigParser()
    config.read('/home/jovyan/.aws/credentials')
    return {
        "access_key": config['default']['aws_access_key_id'],
        "secret_key": config['default']['aws_secret_access_key'],
        "token": config['default']['aws_session_token'],
        "expiry_time": config['default']['aws_expiration']
    }

session_credentials = RefreshableCredentials.create_from_metadata(
        metadata = refresh_external_credentials(),
        refresh_using = refresh_external_credentials,
        method = 'custom-refreshing-key-file-reader'
)

def make_key_refresh_boto3(this_s3contents_instance):
    refresh_session =  get_session() # from botocore.session
    refresh_session._credentials = session_credentials
    my_s3_session =  boto3.Session(botocore_session=refresh_session)
    this_s3contents_instance.boto3_session = my_s3_session

# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager

c.S3ContentsManager.init_s3_hook = make_key_refresh_boto3

MinIO playground example

You can test this using the play.minio.io:9000 playground:

Just be sure to create the bucket first.

from s3contents import S3ContentsManager

c = get_config()

# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.access_key_id = "Q3AM3UQ867SPQQA43P2F"
c.S3ContentsManager.secret_access_key = "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG"
c.S3ContentsManager.endpoint_url = "https://play.minio.io:9000"
c.S3ContentsManager.bucket = "s3contents-demo"
c.S3ContentsManager.prefix = "notebooks/test"

Access local files

To access local file as well as remote files in S3 you can use hybridcontents.

Install it:

pip install hybridcontents

Use a configuration similar to this:

from s3contents import S3ContentsManager
from hybridcontents import HybridContentsManager
from notebook.services.contents.largefilemanager import LargeFileManager

c = get_config()

c.ServerApp.contents_manager_class = HybridContentsManager

c.HybridContentsManager.manager_classes = {
    # Associate the root directory with an S3ContentsManager.
    # This manager will receive all requests that don"t fall under any of the
    # other managers.
    "": S3ContentsManager,
    # Associate /local_directory with a LargeFileManager.
    "local_directory": LargeFileManager,
}

c.HybridContentsManager.manager_kwargs = {
    # Args for root S3ContentsManager.
    "": {
        "access_key_id": "<AWS Access Key ID / IAM Access Key ID>",
        "secret_access_key": "<AWS Secret Access Key / IAM Secret Access Key>",
        "bucket": "<S3 bucket name>",
    },
    # Args for the LargeFileManager mapped to /local_directory
    "local_directory": {
        "root_dir": "/Users/danielfrg/Downloads",
    },
}

GCP - Google Cloud Storage

Install the extra dependencies with:

pip install s3contents[gcs]
from s3contents.gcs import GCSContentsManager

c = get_config(

c.ServerApp.contents_manager_class = GCSContentsManager
c.GCSContentsManager.project = "<your-project>"
c.GCSContentsManager.token = "~/.config/gcloud/application_default_credentials.json"
c.GCSContentsManager.bucket = "<GCP bucket name>"

Note that the file ~/.config/gcloud/application_default_credentials.json assumes a POSIX system when you did gcloud init.

Other configuration

File Save Hooks

If you want to use pre/post file save hooks here are some examples.

A pre_save_hook is written in the exact same way as normal, operating on the file in local storage before committing it to the object store.

def scrub_output_pre_save(model, **kwargs):
    """
    Scrub output before saving notebooks
    """

    # only run on notebooks
    if model["type"] != "notebook":
        return

    # only run on nbformat v4
    if model["content"]["nbformat"] != 4:
        return

    for cell in model["content"]["cells"]:
        if cell["cell_type"] != "code":
            continue
        cell["outputs"] = []
        cell["execution_count"] = None

c.S3ContentsManager.pre_save_hook = scrub_output_pre_save

A post_save_hook instead operates on the file in object storage, because of this it is useful to use the file methods on the contents_manager for data manipulation. In addition, one must use the following function signature (unique to s3contents):

def make_html_post_save(model, s3_path, contents_manager, **kwargs):
    """
    Convert notebooks to HTML after saving via nbconvert
    """
    from nbconvert import HTMLExporter

    if model["type"] != "notebook":
        return

    content, _format = contents_manager.fs.read(s3_path, format="text")
    my_notebook = nbformat.reads(content, as_version=4)

    html_exporter = HTMLExporter()
    html_exporter.template_name = "classic"

    (body, resources) = html_exporter.from_notebook_node(my_notebook)

    base, ext = os.path.splitext(s3_path)
    contents_manager.fs.write(path=(base + ".html"), content=body, format=_format)

c.S3ContentsManager.post_save_hook = make_html_post_save

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3contents-0.10.0.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

s3contents-0.10.0-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file s3contents-0.10.0.tar.gz.

File metadata

  • Download URL: s3contents-0.10.0.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.0

File hashes

Hashes for s3contents-0.10.0.tar.gz
Algorithm Hash digest
SHA256 10179f0b7ce7facf02b58c7c9f1cd54176552798b59d2ece8794f198b1ddd75a
MD5 99e00a8ad08ca859fcd3cbffdac35d83
BLAKE2b-256 b68e0101ee5fa570edeab5d0236c9e8a8e7281771cc63c1a42c93a59df94449a

See more details on using hashes here.

File details

Details for the file s3contents-0.10.0-py3-none-any.whl.

File metadata

File hashes

Hashes for s3contents-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce7914ba81fe8ece4f9bd7bee2b536bc34c787b7e2159552817df22e14b55054
MD5 f5f916ea0de94b668074e4b90cbec6e9
BLAKE2b-256 9329e236feef78c940c09c8dac2161344ff95be39a4fa140992b68e1bb399622

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page