Skip to main content

Machine Learning, Statistics and Utilities around Developer Productivity, Company Productivity and Project Productivity

Project description

CircleCI Codacy Badge

devml

Machine Learning, Statistics and Utilities around Developer Productivity

A few handy bits of functionality:

  • Can checkout all repositories in Github

  • Converts a tree of checked out repositories on disk into a pandas dataframe

  • Statistics on combined DataFrames

Installation

pip install devml

Get environment setup

Code is written to support Python 3.6 or greater. You can get that here: https://www.python.org/downloads/release/python-360/.

An easy way to run the project locally is to check the repo out and in the root of the repo run:

make setup

This create a virtualenv in ~/.devml

Next, source that virtualenv:

source ~/.devml/bin/activate

Run Make All (installs, lints and tests)

make all

# #Example output
#(.devml) ➜  devml git:(master) make all
#pip install -r requirements.txt
#Requirement already satisfied: pytest in /Users/noahgift/.devml/lib/python3.6/site-packages (from -r requirements.txt (line #1)
---------- coverage: platform darwin, python 3.6.2-final-0 -----------
Name                       Stmts   Miss  Cover
----------------------------------------------
devml/__init__.py              1      0   100%
devml/author_stats.py          6      6     0%
devml/fetch_repo.py           54     42    22%
devml/mkdata.py               84     21    75%
devml/org_stats.py            76     55    28%
devml/post_processing.py      50     35    30%
devml/state.py                29      9    69%
devml/stats.py                55     43    22%
devml/ts.py                   29     14    52%
devml/util.py                 12      4    67%
dml.py                       111     66    41%
----------------------------------------------
TOTAL                        507    295    42%
....

You don’t use virtualenv or don’t want to use it. No problem, just run make all it should probably work if you have python 3.6 installed.

make all

Explore Jupyter Notebooks on Github Organizations

You can explore combined datasets here using this example as a starter:

https://github.com/noahgift/devml/blob/master/notebooks/github_data_exploration.ipynb

Pallets Project

Pallets Project

Explore Jupyter Notebooks on Repository Churn

You can explore File Metadata exploration example here:

https://github.com/noahgift/devml/blob/master/notebooks/repo_file_exploration.ipynb

All Files Churned by type:

Pallets Project Relative Churn by file type

Pallets Project Relative Churn by file type

Summary Churn Statistics by type:

Pallets Project by file type Churn statistics

Pallets Project by file type Churn statistics

Expected Configuration

The command-line tools expects for you to create a project directory with a config.json file. Inside the config.json file, you will need to provide an oath token. You can find information about how to do that here: https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/.

Alternately, you can pass these values in via the python API or via the command-line as options. They stand for the following:

  • org: Github Organization (To clone entire tree of repos)

  • checkout_dir: place to checkout

  • oath: personal oath token generated from Github

➜  devml git:(master) ✗ cat project/config.json
{
    "project" :
        {
            "org":"pallets",
            "checkout_dir": "/tmp/checkout",
            "oath": "<keygenerated from Github>"
        }

}

Basic command-line Usage

You can find out stats for a checkout or a directory full of checkout as follows

python dml.py gstats author --path ~/src/mycompanyrepo(s)
Top Commits By Author:                     author_name  commits
0                     John Smith     3059
1                      Sally Joe     2995
2                   Greg Mathews     2194
3                 Jim Mayflower      1448

Basic API Usage (Converting a tree of repo(s) into a pandas DataFrame)

In [1]: from devml import (mkdata, stats)

In [2]: org_df = mkdata.create_org_df(path=/src/mycompanyrepo(s)")
In [3]: author_counts = stats.author_commit_count(org_df)

In [4]: author_counts.head()
Out[4]:
      author_name  commits
0       John Smith     3059
1        Sally Joe     2995
2     Greg Mathews     2194
3    Jim Mayflower     1448
4   Truck Pritter      1441

Clone all repos in Github using API

In [1]: from devml import (mkdata, stats, state, fetch_repo)

In [2]: dest, token, org = state.get_project_metadata("../project/config.json")
In [3]: fetch_repo.clone_org_repos(token, org,
        dest, branch="master")
017-10-14 17:11:36,590 - devml - INFO - Creating Checkout Root:  /tmp/checkout
2017-10-14 17:11:37,346 - devml - INFO - Found Repo # 1 REPO NAME: flask , URL: git@github.com:pallets/flask.git
2017-10-14 17:11:37,347 - devml - INFO - Found Repo # 2 REPO NAME: pallets-sphinx-themes , URL: git@github.com:pallets/pallets-sphinx-themes.git
2017-10-14 17:11:37,347 - devml - INFO - Found Repo # 3 REPO NAME: markupsafe , URL: git@github.com:pallets/markupsafe.git
2017-10-14 17:11:37,348 - devml - INFO - Found Repo # 4 REPO NAME: jinja , URL: git@github.com:pallets/jinja.git
2017-10-14 17:11:37,349 - devml - INFO - Found Repo # 5 REPO NAME: werkzeug , URL: git@githu
In [4]: !ls -l /tmp/checkout
total 0
drwxr-xr-x  21 noahgift  wheel  672 Oct 14 17:11 click
drwxr-xr-x  25 noahgift  wheel  800 Oct 14 17:11 flask
drwxr-xr-x  11 noahgift  wheel  352 Oct 14 17:11 flask-docs
drwxr-xr-x  12 noahgift  wheel  384 Oct 14 17:11 flask-ext-migrate
drwxr-xr-x   8 noahgift  wheel  256 Oct 14 17:11 flask-snippets
drwxr-xr-x  14 noahgift  wheel  448 Oct 14 17:11 flask-website
drwxr-xr-x  18 noahgift  wheel  576 Oct 14 17:11 itsdangerous
drwxr-xr-x  23 noahgift  wheel  736 Oct 14 17:11 jinja
drwxr-xr-x  18 noahgift  wheel  576 Oct 14 17:11 markupsafe
drwxr-xr-x   4 noahgift  wheel  128 Oct 14 17:11 meta
drwxr-xr-x  10 noahgift  wheel  320 Oct 14 17:11 pallets-sphinx-themes
drwxr-xr-x   9 noahgift  wheel  288 Oct 14 17:11 pocoo-sphinx-themes
drwxr-xr-x  15 noahgift  wheel  480 Oct 14 17:11 website
drwxr-xr-x  25 noahgift  wheel  800 Oct 14 17:11 werkzeug

Advanced CLI-Author: Get Activity Statistics for a Tree of Checkouts or a Checkout and sort

 ➜  devml git:(master) ✗ python dml.py gstats activity --path /tmp/checkout --sort active_days

Top Unique Active Days:               author_name  active_days active_duration  active_ratio
86         Armin Ronacher          989       3817 days      0.260000
501  Markus Unterwaditzer          342       1820 days      0.190000
216            David Lord          129        712 days      0.180000
664           Ron DuPlain           78        854 days      0.090000
444         Kenneth Reitz           68       2566 days      0.030000
197      Daniel Neuhäuser           42       1457 days      0.030000
297          Georg Brandl           41       1337 days      0.030000
196     Daniel Neuhäuser           36        435 days      0.080000
450      Keyan Pishdadian           28        885 days      0.030000
169     Christopher Grebs           28       1515 days      0.020000
666    Ronny Pfannschmidt           27       3060 days      0.010000
712           Simon Sapin           22        793 days      0.030000
372           Jeff Widman           19        840 days      0.020000
427    Julen Ruiz Aizpuru           16         36 days      0.440000
21                 Adrian           16       1935 days      0.010000
569        Nicholas Wiles           14        197 days      0.070000
912                lord63           14        692 days      0.020000
756           ThiefMaster           12       1287 days      0.010000
763       Thomas Waldmann           11       1560 days      0.010000
628            Priit Laes           10       1567 days      0.010000
23        Adrian Moennich           10        521 days      0.020000
391  Jochen Kupperschmidt           10       3060 days      0.000000

Advanced CLI-Churn: Get churn by file type

Get the top ten files sorted by churn count with the extension .py:

✗ python dml.py gstats churn --path /Users/noahgift/src/flask --limit 10 --ext .py
2017-10-15 12:10:55,783 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/flask]
                       files  churn_count  line_count extension  \
1            b'flask/app.py'          316      2183.0       .py
3        b'flask/helpers.py'          176      1019.0       .py
5    b'tests/flask_tests.py'          127         NaN       .py
7                b'flask.py'          104         NaN       .py
8                b'setup.py'           80       112.0       .py
10           b'flask/cli.py'           75       759.0       .py
11      b'flask/wrappers.py'           70       194.0       .py
12      b'flask/__init__.py'           65        49.0       .py
13           b'flask/ctx.py'           62       415.0       .py
14  b'tests/test_helpers.py'           62       888.0       .py

    relative_churn
1             0.14
3             0.17
5              NaN
7              NaN
8             0.71
10            0.10
11            0.36
12            1.33
13            0.15
14            0.07

Get descriptive statistics for extension .py and compare to another repository

In this example, flask, this repo and cpython are all compared to see how the median churn is.

(.devml) ➜  devml git:(master) python dml.py gstats metachurn --path /Users/noahgift/src/flask --ext .py --statistic median
2017-10-15 12:39:44,781 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/flask]
MEDIAN Statistics:

           churn_count  line_count  relative_churn
extension
.py                  2        85.0            0.13
(.devml) ➜  devml git:(master) python dml.py gstats metachurn --path /Users/noahgift/src/devml --ext .py --statistic median
2017-10-15 12:40:10,999 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/devml]
MEDIAN Statistics:

           churn_count  line_count  relative_churn
extension
.py                  1        62.5            0.02

(.devml) ➜  devml git:(master) python dml.py gstats metachurn --path /Users/noahgift/src/cpython --ext .py --statistic median
2017-10-15 12:42:19,260 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/cpython]
MEDIAN Statistics:

           churn_count  line_count  relative_churn
extension
.py                  7       169.5             0.1

Get Relative Churn for an Author

python dml.py gstats authorchurnmeta --author "Armin Ronacher" --path /tmp/checkout/flask --ext .py

#He has 6.5% median relative churn...very good.

count    193.000000
mean       0.331860
std        0.625431
min        0.001000
25%        0.030000
50%        0.065000
75%        0.250000
max        3.000000
Name: author_rel_churn, dtype: float64

Deletion Statistics

Find all delete files from repository

DELETION STATISTICS

                                                 files          ext
0                        b'tests/test_deprecations.py'          .py
1                       b'scripts/flask-07-upgrade.py'          .py
2                             b'flask/ext/__init__.py'          .py
3                                  b'flask/exthook.py'          .py
4                        b'scripts/flaskext_compat.py'          .py
5                                 b'tests/test_ext.py'          .py

FAQ

What is Churn and Why Do I Care?

Code churn is the amount of times a file has been modified. Relative churn is the amount of times it has been modified relative to lines of code. Research into defects in software has shown that relative code churn is highly predictive of defects, i.e., the greater the relative churn number the higher the amount of defects.

“Increase in relative code churn measures is accompanied by an increase in system defect density; “

You can read the entire study here: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/icse05churn.pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

devml-0.4.tar.gz (16.4 kB view details)

Uploaded Source

File details

Details for the file devml-0.4.tar.gz.

File metadata

  • Download URL: devml-0.4.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for devml-0.4.tar.gz
Algorithm Hash digest
SHA256 b31eaade63ecc4c000c190d6cb60dbf316643abe7ec286b0f40c5c9ccbdacba6
MD5 0dc0fdfb5ca964a49cfc8cbd864e1ff1
BLAKE2b-256 ed0860ca5fe92b4cf1c421e9c7095770b3d02db06604a361c75ce490c4ae3087

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page