Manage and automatize datasets for data science projects.
Project description
Dataset Manager
Manage and automatize your datasets for your project with YAML files.
How it Works
This project create a file called identifier.yaml in your dataset directory with these fields:
source: https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv
description: this dataset is a test dataset
identifier: is the identifier for dataset reference is the file name with yaml extension.
source: is location from dataset.
description: describe your dataset to remember later.
Each dataset is a YAML file inside dataset directory.
Installing
With pip just:
pip install dataset_manager
With conda:
conda install dataset_manager
Using
You can manage your datasets with a list of commands and integrate with Pandas or other data analysis tool.
Manager functions
Show all Datasets
Return a table with all Datasets from dataset path
from dataset_manager import DatasetManager
manager = DatasetManager(dataset_path, local_path_to_download)
manager.show_datasets()
Create a Dataset
Create a Dataset with every information you want inside dataset_path defined.
from dataset_manager import DatasetManager
manager = DatasetManager(dataset_path, local_path_to_download)
manager.create_dataset(identifier, source, description, **kwargs)
Remove a Dataset
Remove Dataset from dataset_path
from dataset_manager import DatasetManager
manager = DatasetManager(dataset_path, local_path_to_download)
manager.remove_dataset(identifier)
Prepare Datasets
Download and Unzip all Datasets
from dataset_manager import DatasetManager
manager = DatasetManager(dataset_path, local_path_to_download)
manager.prepare_datasets()
Using Multiple Filesystems
This manager is integrated with Pyfilesystem2 and you can use all builtin filesystems or with third-party extensions or creating your own extension.
With Pyfilesystem2, you can download, extract and manage datasets in any place.
from fs.tempfs import TempFS
from dataset_manager import DatasetManager
manager = DatasetManager(dataset_path, local_path_to_download, TempFS())
manager.prepare_datasets() # all datasets will be downloaded and extracted on temporary files respecting your local_path_to_download hierarchy
Get one Dataset
Get Dataset line as dict
import pandas as pd
from dataset_manager import DatasetManager
manager = DatasetManager(dataset_path, local_path_to_download)
dataset = manager.get_dataset(identifier)
df = pd.read_csv(dataset.uri)
Dataset functions
Download Dataset
Download Dataset based on source. This only download once because validates cache. It works both with HTTP, HTTPS and FTP protocols.
dataset = manager.get_dataset(identifier)
dataset.download()
Unzip Dataset
Unzip Dataset based on dataset uri. It works with zip files and others from supported library: fs.archive
dataset = manager.get_dataset(identifier)
dataset.unzip()
Prepare Dataset
Prepare Dataset combine these two before.
dataset = manager.get_dataset(identifier)
dataset.prepare()
Contributing
Just make pull request and be happy!
Let's grow together ;)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dataset_manager-0.1.0.tar.gz
.
File metadata
- Download URL: dataset_manager-0.1.0.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f742e3cd398b715eb07dec2edbe0fbc2d1b0ac571aa5a00648726913e452f55 |
|
MD5 | 8df467334c1945846e3db37ebe318211 |
|
BLAKE2b-256 | 6b8e0308c7a3bbefb777da88a22c4a507ed99348cb662077cc2fec48f9e1ef1c |
File details
Details for the file dataset_manager-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: dataset_manager-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a32b8535f8b5a34569e3f94f895c3071533f709f3f8261a8860569e57aaf95ea |
|
MD5 | 83ceaf705f421a4b8776b70d72e26aa1 |
|
BLAKE2b-256 | cec0935cab3b1b7932892c9579e00b147c58103b4b0363437967782cd3374318 |