Skip to main content

Pipeline Profiler tool. Enables the exploration of D3M pipelines in Jupyter Notebooks

Project description

PipelineProfiler

AutoML Pipeline exploration tool compatible with Jupyter Notebooks. Supports auto-sklearn and D3M pipeline format.

arxiv badge

System screen

(Shift click to select multiple pipelines)

Paper: https://arxiv.org/abs/2005.00160

Video: https://youtu.be/2WSYoaxLLJ8

Blog: Medium post

Demo

Live demo (Google Colab):

In Jupyter Notebook:

import PipelineProfiler
data = PipelineProfiler.get_heartstatlog_data()
PipelineProfiler.plot_pipeline_matrix(data)

Install

Option 1: install via pip:

pip install pipelineprofiler

Option 2: Run the docker image:

docker build -t pipelineprofiler .
docker run -p 9999:8888 pipelineprofiler

Then copy the access token and log in to jupyter in the browser url:

localhost:9999

Data preprocessing

PipelineProfiler reads data from the D3M Metalearning database. You can download this data from: https://metalearning.datadrivendiscovery.org/dumps/2020/03/04/metalearningdb_dump_20200304.tar.gz

You need to merge two files in order to explore the pipelines: pipelines.json and pipeline_runs.json. To do so, run

python -m PipelineProfiler.pipeline_merge [-n NUMBER_PIPELINES] pipeline_runs_file pipelines_file output_file

Pipeline exploration

import PipelineProfiler
import json

In a jupyter notebook, load the output_file

with open("output_file.json", "r") as f:
    pipelines = json.load(f)

and then plot it using:

PipelineProfiler.plot_pipeline_matrix(pipelines[:10])

Data postprocessing

You might want to group pipelines by problem type, and select the top k pipelines from each team. To do so, use the code:

def get_top_k_pipelines_team(pipelines, k):
    team_pipelines = defaultdict(list)
    for pipeline in pipelines:
        source = pipeline['pipeline_source']['name']
        team_pipelines[source].append(pipeline)
    for team in team_pipelines.keys():
        team_pipelines[team] = sorted(team_pipelines[team], key=lambda x: x['scores'][0]['normalized'], reverse=True)
        team_pipelines[team] = team_pipelines[team][:k]
    new_pipelines = []
    for team in team_pipelines.keys():
        new_pipelines.extend(team_pipelines[team])
    return new_pipelines

def sort_pipeline_scores(pipelines):
    return sorted(pipelines, key=lambda x: x['scores'][0]['value'], reverse=True)    

pipelines_problem = {}
for pipeline in pipelines:  
    problem_id = pipeline['problem']['id']
    if problem_id not in pipelines_problem:
        pipelines_problem[problem_id] = []
    pipelines_problem[problem_id].append(pipeline)
for problem in pipelines_problem.keys():
    pipelines_problem[problem] = sort_pipeline_scores(get_top_k_pipelines_team(pipelines_problem[problem], k=100))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipelineprofiler-0.1.16.tar.gz (868.6 kB view details)

Uploaded Source

Built Distributions

pipelineprofiler-0.1.16-py3.6.egg (897.0 kB view details)

Uploaded Source

pipelineprofiler-0.1.16-py3-none-any.whl (879.6 kB view details)

Uploaded Python 3

File details

Details for the file pipelineprofiler-0.1.16.tar.gz.

File metadata

  • Download URL: pipelineprofiler-0.1.16.tar.gz
  • Upload date:
  • Size: 868.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.9

File hashes

Hashes for pipelineprofiler-0.1.16.tar.gz
Algorithm Hash digest
SHA256 0679e0ab18ca86271d39ab2718c1189263dee3248d147ade2568ded9b9e9ef41
MD5 6a679fc281b4d64cf7a47633dfce8f8a
BLAKE2b-256 b6e40ccc6df9f79ff7af205645df165ae08ac6e120ee2cb1887fdd90bbcf1348

See more details on using hashes here.

File details

Details for the file pipelineprofiler-0.1.16-py3.6.egg.

File metadata

  • Download URL: pipelineprofiler-0.1.16-py3.6.egg
  • Upload date:
  • Size: 897.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.9

File hashes

Hashes for pipelineprofiler-0.1.16-py3.6.egg
Algorithm Hash digest
SHA256 485a45192692b5089147cb82c33f780302d1de31b2b53a99031422bd90df66f9
MD5 64fd53bc7719e17cf17ca24c1f3be052
BLAKE2b-256 ea3ce3358c81c14f0bbb7fc37e46fb171afeca710b3af4caa060d74010de9fc6

See more details on using hashes here.

File details

Details for the file pipelineprofiler-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: pipelineprofiler-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 879.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0.post20191030 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.9

File hashes

Hashes for pipelineprofiler-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 84b787d98f155a84fd327ae9f2e8c8d3e229dd0933eba722d983dea100a5111e
MD5 b14995f4535f6ccc59c23d4cde3565bb
BLAKE2b-256 597f949b9185d2876c0dc0e947a71ff0e70088fa3a4d424581e6fd2720b6a956

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page