Skip to main content

No project description provided

Project description

Apache License Read The Doc javadoc Pypi version Github Action stability-experimental

Join the community: Join the chat at https://gitter.im/rikaidev/community

:heavy_exclamation_mark: This repository is still experimental. No API-compatibility is guaranteed.

Rikai

Rikai is a parquet based ML data format built for working with unstructured data at scale. Processing large amounts of data for ML is never trivial, but is especially true for images and videos often at the core of deep learning applications. We are building Rikai with two main goals:

  1. Enable ML engineers/researchers to have a seamless workflow from Feature Engineering (Spark) to Training (PyTorch/Tensorflow), from notebook to production.
  2. Enable advanced analytics capabilities to support much faster active learning, model debugging, and monitoring in production pipelines.

Current (v0.0.11) main features:

  1. Native support in Jupyter, Scikit-learn, Spark and PyTorch for images, videos and annotations: reduce ad-hoc type conversions and boilerplate when moving between ETL and training.
  2. Custom functionality for working with images and videos at scale: high-level APIs for processing, filtering, sampling, and more.
  3. Run ML-models via SQL. Forget Smart Homes, build a Smart Data Warehouse.

Roadmap:

  1. TensorFlow integration
  2. Versioning support built into the dataset
  3. Even richer video capabilities (ffmpeg-python integration)
  4. Declarative annotation API (think vega-lite for annotating images/videos)

Example

from pyspark.sql import Row
from pyspark.ml.linalg import DenseMetrix
from rikai.types import Image, Box2d
from rikai.numpy import wrap
import numpy as np

df = spark.createDataFrame(
    [
        {
            "id": 1,
            "mat": DenseMatrix(2, 2, range(4)),
            "image": Image("s3://foo/bar/1.png"),
            "annotations": [
                Row(
                    label="cat",
                    mask=wrap(np.random.rand(256, 256)),
                    bbox=Box2d(xmin=1.0, ymin=2.0, xmax=3.0, ymax=4.0),
                )
            ],
        }
    ]
)

df.write.format("rikai").save("s3://path/to/features")

Train dataset in Pytorch

from rikai.torch.vision import Dataset
from rikai.torch import DataLoader # Do not need this with Pytorch 1.8+
from torchvision import transforms as T

transform = T.Compose([
   T.Resize(640),
   T.ToTensor(),
   T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

dataset = Dataset(
   "s3://path/to/features",
   columns=["image"],
   transform=transform
)
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=8,
)
for batch in data_loader:
    predicts = model(batch.to(cuda))

Using a ML model in Spark SQL (experiemental)

CREATE MODEL yolo5
OPTIONS (min_confidence=0.3, device="gpu", batch_size=32)
USING "s3://bucket/to/yolo5_spec.yaml";

SELECT id, ML_PREDICT(yolo5, image) FROM my_dataset
WHERE split = "train" LIMIT 100;

Rikai can use MLflow as its model registry. This allows you to automatically pickup the latest model version if you're using the mlflow model registry. Here is a list of supported model flavors:

  • PyTorch (pytorch)
  • Scikit-learn (sklearn)
CREATE MODEL yolo5
OPTIONS (min_confidence=0.3, device="gpu", batch_size=32)
USING "mlflow:///yolo5_model/";

SELECT id, ML_PREDICT(yolo5, image) FROM my_dataset
WHERE split = "train" LIMIT 100;

Getting Started

Currently Rikai is maintained for Scala 2.12 and Python 3.7 and 3.8.

There are multiple ways to install Rikai:

  1. Try it using the included Dockerfile.
  2. OR install it via pip pip install rikai, with extras for gcp, pytorch/tf, and others.
  3. OR install it from source

Note: if you want to use Rikai with your own pyspark, please consult rikai documentation for tips.

Docker

The included Dockerfile creates a standalone demo image with Jupyter, Pytorch, Spark, and rikai preinstalled with notebooks for you to play with the capabilities of the rikai feature store.

To build and run the docker image from the current directory:

# Clone the repo
git clone git@github.com:eto-ai/rikai rikai
# Build the docker image
docker build --tag rikai --network host .
# Run the image
docker run -p 0.0.0.0:8888:8888/tcp rikai:latest jupyter lab -ip 0.0.0.0 --port 8888

If successful, the console should then print out a clickable link to JupyterLab. You can also open a browser tab and go to localhost:8888.

Install from pypi

Base rikai library can be installed with just pip install rikai. Dependencies for supporting pytorch (pytorch and torchvision), jupyter (matplotlib and jupyterlab) are all part of optional extras. Many open-source datasets also use Youtube videos so we've also added pafy and youtube-dl as optional extras as well.

For example, if you want to use pytorch in Jupyter to train models on rikai datasets in s3 containing Youtube videos you would run:

pip install rikai[pytorch,jupyter,youtube]

If you're not sure what you need and don't mind installing some extra dependencies, you can simply install everything:

pip install rikai[all]

Install from source

To build from source you'll need python as well as Scala with sbt installed:

# Clone the repo
git clone git@github.com:eto-ai/rikai rikai
# Build the jar
sbt publishLocal
# Install python package
cd python
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rikai-0.0.15.tar.gz (57.6 kB view details)

Uploaded Source

Built Distribution

rikai-0.0.15-py3-none-any.whl (96.8 kB view details)

Uploaded Python 3

File details

Details for the file rikai-0.0.15.tar.gz.

File metadata

  • Download URL: rikai-0.0.15.tar.gz
  • Upload date:
  • Size: 57.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.12

File hashes

Hashes for rikai-0.0.15.tar.gz
Algorithm Hash digest
SHA256 8f801b379b0d75b9fa907ed4f0f8dfea67749c201d86c6d70da5bda1e5e3acff
MD5 e8bd6aec54fc88bd9c482420dfa91a7b
BLAKE2b-256 24fd8e12e16eeafbf16b230471ad140502b624debbaa5e0e892e8b0e71ab57b1

See more details on using hashes here.

Provenance

File details

Details for the file rikai-0.0.15-py3-none-any.whl.

File metadata

  • Download URL: rikai-0.0.15-py3-none-any.whl
  • Upload date:
  • Size: 96.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.12

File hashes

Hashes for rikai-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 f9d3ca30726cb31ac30e7cf473f7a14920cd6558d033a5df42f5176df58e7aae
MD5 49c5f99ce5c0f1d79ed0cd23bb97b65b
BLAKE2b-256 fd65c0a1748a921576cc00275f9ee74be76b25faac93f021eea755dccb35df2b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page