Skip to main content

Manage and automatize datasets for data science projects.

Project description

Dataset Manager

Manage and automatize your datasets for your project with YAML files.

Build Status

Current Support: Python 3.5Python 3.6Python 3.7Python 3.8

How it Works

This project create a file called identifier.yaml in your dataset directory with these fields:

source: https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv

description: this dataset is a test dataset

identifier: is the identifier for dataset reference is the file name with yaml extension.

source: is location from dataset.

description: describe your dataset to remember later.

Each dataset is a YAML file inside dataset directory.

Installing

With pip just:

pip install dataset_manager

With conda:

conda install dataset_manager

Using

You can manage your datasets with a list of commands and integrate with Pandas or other data analysis tool.

Manager functions

Show all Datasets

Return a table with all Datasets from dataset path

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.show_datasets()

Create a Dataset

Create a Dataset with every information you want inside dataset_path defined.

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.create_dataset(identifier, source, description, **kwargs)

Remove a Dataset

Remove Dataset from dataset_path

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.remove_dataset(identifier)

Prepare Datasets

Download and Unzip all Datasets

from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

manager.prepare_datasets()

Using Multiple Filesystems

This manager is integrated with Pyfilesystem2 and you can use all builtin filesystems or with third-party extensions or creating your own extension.

With Pyfilesystem2, you can download, extract and manage datasets in any place.

from fs.tempfs import TempFS
from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download, TempFS())

manager.prepare_datasets() # all datasets will be downloaded and extracted on temporary files respecting your local_path_to_download hierarchy

Get one Dataset

Get Dataset line as dict

import pandas as pd
from dataset_manager import DatasetManager

manager = DatasetManager(dataset_path, local_path_to_download)

dataset = manager.get_dataset(identifier)

df = pd.read_csv(dataset.uri)

Dataset functions

Download Dataset

Download Dataset based on source. This only download once because validates cache. It works both with HTTP, HTTPS and FTP protocols.

dataset = manager.get_dataset(identifier)

dataset.download()

Unzip Dataset

Unzip Dataset based on dataset uri. It works with zip files and others from supported library: fs.archive

dataset = manager.get_dataset(identifier)

dataset.unzip()

Prepare Dataset

Prepare Dataset combine these two before.

dataset = manager.get_dataset(identifier)

dataset.prepare()

Contributing

Just make pull request and be happy!

Let's grow together ;)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_manager-0.0.16.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

dataset_manager-0.0.16-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file dataset_manager-0.0.16.tar.gz.

File metadata

  • Download URL: dataset_manager-0.0.16.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for dataset_manager-0.0.16.tar.gz
Algorithm Hash digest
SHA256 1f4facc4ba134bbf397a3346dcfbfaedabd3e6f39eaafac2d130b4e284c20c11
MD5 ea8aa594b1dee221476a137c2771974c
BLAKE2b-256 b52be7c9330088af578ad2e0b27e3a43b7446e296b4f9174d7dacbb31e4af38a

See more details on using hashes here.

File details

Details for the file dataset_manager-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: dataset_manager-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for dataset_manager-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 158b2b7c8a6f91df6cdd5318d211528fbe345501e3db39ca02e571f1d3123d44
MD5 d9632172e83c0309419a7bbe293b9f1b
BLAKE2b-256 8e6b1cb50842cc527a3682166e427933ebf1362508253be9cef21d217c97fc55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page