Kazu's Log

Kazuyoshi Kato is a Japanese software developer in Seattle, Washington. He is interested in programming languages, data visualization and prototyping.

I recently finished Udacity’s Intro to Machine Learning, while I haven’t finished the final assignment yet. The next step may be learning Deep Neural Network but before that, I’d like to see what I can do with what I’ve learned.

The course was using Enron email dataset a lot, and I want to use something similar but not the same to recap things. The dataset should have messages, authors, … How about Git repositories?

But before starting the machine learning part, the first step is loading a Git repository into Python.

GitPython

There are multiple Python libraries that can interact with Git. I’m unsure which would be the best, but GitPython is good enough for me.

First, I convert a Git repository (I used facebook/react) into a JSON file.

import git
import json
...

def commit_summary(c):
    result = {}
    for path, stats in c.stats.files.iteritems():
        for k in stats:
            result[k] = result.get(k, 0) + stats[k]
    result['file_count'] = len(c.stats.files)
    result['committed_date'] = c.committed_date
    result['hexsha'] = c.hexsha
    result['message'] = c.message
    result['email'] = c.author.email
    return result

react = git.Repo('../react')

with open('react-commits.json', 'w') as out:
    out.write('[\n')

    commits = react.iter_commits('master')
    index = 0
    for c in commits:
        if index != 0:
            out.write(',\n')
        index += 1

        json.dump(commit_summary(c), out)

    out.write(']\n')

Then load the JSON file into Pandas.

import pandas as pd
...

commits = pd.read_json('react-commits.json')
commits['committed'] = pd.to_datetime(commits['committed_date'], unit = 's')

React Commits over Time

The Y-axis have insertions and deletions.

Commits on facebook/react
ggplot(aes('committed', 'insertions'), commits) + \
  geom_line(aes(color = 1)) + \
  geom_line(aes('committed', '-deletions', color = 2)) + \
  ylab('Added/Deleted')  + xlab('Committed Date') + \
  guides(color=False) + scale_color_gradient()

There are a few spikes on deletions (newer to older);

  1. Delete documentation and website source (#11137)
  2. [site] Load libraries from unpkg (#9499)
  3. New Documentation
  4. Merge remote-tracking branch ‘facebook/master’
  5. remove likebutton from docs for now

While most of them were administrative changes, the last, oldest commit was a bit funny;

it has some facebook-ism in there and it’s probably shouldn’t be on the site.

I would agree so :)

React Committers

This plot, React commiters in the Y-axis is aesthetically pleasing. It would be interesting to compare that to similar projects, such as Vue or Angular.

Commits on facebook/react
commits['is_fb_com'] = commits['email'].str.contains('@fb.com')

ggplot(aes('committed', 'factor(email)', color='is_fb_com'), commits) + \
  geom_point(aes(size = 'lines')) + \
  ylab('Author')  + xlab('Committed Date') + \
  scale_y_discrete(labels = [])

The amount of commits from non-fb.com email addresses is surprising. I wanted to use that as an indicator of Facebook employees, but that may be too naive.