An efficient Python implementation of the Apriori algorithm.
Project description
# Efficient-Apriori [![Build Status](https://travis-ci.com/tommyod/Efficient-Apriori.svg?branch=master)](https://travis-ci.com/tommyod/Efficient-Apriori)
An efficient pure Python implementation of the Apriori algorithm.
The apriori algorithm uncovers hidden structures in categorical data.
The classical example is a database containing purchases from a supermarket.
Every purchase has a number of items associated with it.
We would like to uncover association rules such as `{bread, eggs} -> {bacon}` from the data.
This is the goal of [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning), and the [Apriori algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) is arguably the most famous algorithm for this problem.
This repository contains an efficient, well-tested implementation of the apriori algorithm as descriped in the [original paper](https://www.macs.hw.ac.uk/~dwcorne/Teaching/agrawal94fast.pdf) by Agrawal et al, published in 1994.
## Example
Here's a minimal working example.
Notice that in every transaction with `eggs` present, `bacon` is present too.
Therefore, the rule `{eggs} -> {bacon}` is returned with 100 % confidence.
```python
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.5, min_confidence=1)
print(rules) # [{eggs} -> {bacon}, {soup} -> {bacon}]
```
More examples are included below.
## Installation
Here's how to install from GitHub.
```bash
git clone https://github.com/tommyod/Efficient-Apriori.git
cd Efficient-Apriori
pip install .
```
## Contributing
You are very welcome to scrutinize the code and make pull requests if you have suggestions for improvements.
Your submitted code must be PEP8 compliant, and all tests must pass.
## More examples
### Filtering and sorting association rules
It's possible to filter and sort the returned list of association rules.
```python
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.2, min_confidence=1)
# Print out every rule with 2 items on the left hand side,
# 1 item on the right hand side, sorted by lift
rules_rhs = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 1, rules)
for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
print(rule) # Prints the rule and its confidence, support, lift, ...
```
### Working with large datasets
If you have data that is too large to fit into memory, you may pass a function returning a generator instead of a list.
The `min_support` will most likely have to be a large value, or the algorithm will take very long before it terminates.
If you have massive amounts of data, this Python implementation is likely not fast enough, and you should consult more specialized implementations.
```python
def data_generator(filename):
"""
Data generator, needs to return a generator to be called several times.
"""
def data_gen():
with open(filename) as file:
for line in file:
yield tuple(k.strip() for k in line.split(','))
return data_gen
transactions = data_generator('dataset.csv')
itemsets, rules = apriori(transactions, min_support=0.9, min_confidence=0.6)
```
An efficient pure Python implementation of the Apriori algorithm.
The apriori algorithm uncovers hidden structures in categorical data.
The classical example is a database containing purchases from a supermarket.
Every purchase has a number of items associated with it.
We would like to uncover association rules such as `{bread, eggs} -> {bacon}` from the data.
This is the goal of [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning), and the [Apriori algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) is arguably the most famous algorithm for this problem.
This repository contains an efficient, well-tested implementation of the apriori algorithm as descriped in the [original paper](https://www.macs.hw.ac.uk/~dwcorne/Teaching/agrawal94fast.pdf) by Agrawal et al, published in 1994.
## Example
Here's a minimal working example.
Notice that in every transaction with `eggs` present, `bacon` is present too.
Therefore, the rule `{eggs} -> {bacon}` is returned with 100 % confidence.
```python
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.5, min_confidence=1)
print(rules) # [{eggs} -> {bacon}, {soup} -> {bacon}]
```
More examples are included below.
## Installation
Here's how to install from GitHub.
```bash
git clone https://github.com/tommyod/Efficient-Apriori.git
cd Efficient-Apriori
pip install .
```
## Contributing
You are very welcome to scrutinize the code and make pull requests if you have suggestions for improvements.
Your submitted code must be PEP8 compliant, and all tests must pass.
## More examples
### Filtering and sorting association rules
It's possible to filter and sort the returned list of association rules.
```python
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.2, min_confidence=1)
# Print out every rule with 2 items on the left hand side,
# 1 item on the right hand side, sorted by lift
rules_rhs = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 1, rules)
for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
print(rule) # Prints the rule and its confidence, support, lift, ...
```
### Working with large datasets
If you have data that is too large to fit into memory, you may pass a function returning a generator instead of a list.
The `min_support` will most likely have to be a large value, or the algorithm will take very long before it terminates.
If you have massive amounts of data, this Python implementation is likely not fast enough, and you should consult more specialized implementations.
```python
def data_generator(filename):
"""
Data generator, needs to return a generator to be called several times.
"""
def data_gen():
with open(filename) as file:
for line in file:
yield tuple(k.strip() for k in line.split(','))
return data_gen
transactions = data_generator('dataset.csv')
itemsets, rules = apriori(transactions, min_support=0.9, min_confidence=0.6)
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
efficient_apriori-0.4.tar.gz
(11.6 kB
view details)
Built Distributions
File details
Details for the file efficient_apriori-0.4.tar.gz
.
File metadata
- Download URL: efficient_apriori-0.4.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8dcd7f73ed1e8a9220d3a3443e476af487139b1e8c913342bef2559731fc2aa6 |
|
MD5 | b51256a5deb7cddf4ccea6432e6afc79 |
|
BLAKE2b-256 | 7adcf826e5224484df8924e21d5933e6f33e6e74fe16ad13f39a44033791b76f |
File details
Details for the file efficient_apriori-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: efficient_apriori-0.4.1-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31bf15995852f59a43156bfb0e344d51f8dd3a811171e7a97474d1c9244ab97e |
|
MD5 | efb03e03d52965b4d7ad30c09e90fe38 |
|
BLAKE2b-256 | 65e2f4f424e3ce73ff1e7bf77dd1f85d89cb0b19b3589461bbeceb6b8bdb4d28 |
File details
Details for the file efficient_apriori-0.4-py3-none-any.whl
.
File metadata
- Download URL: efficient_apriori-0.4-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 353e09bf44e09c382446cf19dbba73ed688c2e416e4355ed5c9eba11d9f34f3e |
|
MD5 | 2a2281b8c680fa0e133e4c9960a220f1 |
|
BLAKE2b-256 | be6f75acab60a102de6bb3b9c0d27ef823fcaee90b88b3698288181f390b5217 |