Kazu's Log

A software developer in Seattle

Compared to React, Angular’s commit message guidelines are very detailed,

We have very precise rules over how our git commit messages can be formatted. This leads to more readable messages that are easy to follow when looking through the project history. But also, we use the git commit messages to generate the Angular change log.

which are adopted by Vue’s Commit Message Convention.

Because of the guidelines, the most of commit messages on the repos start with “types” that indicate the types of changes.

The below chart is the types of changes on Angular over time.

Angular: Types of Changes

Compard to Angular, Vue is much smaller and sporadic.

Vue: Types of Changes

I recently finished Udacity’s Intro to Machine Learning, while I haven’t finished the final assignment yet. The next step may be learning Deep Neural Network but before that, I’d like to see what I can do with what I’ve learned.

The course was using Enron email dataset a lot, and I want to use something similar but not the same to recap things. The dataset should have messages, authors, … How about Git repositories?

But before starting the machine learning part, the first step is loading a Git repository into Python.


There are multiple Python libraries that can interact with Git. I’m unsure which would be the best, but GitPython is good enough for me.

First, I convert a Git repository (I used facebook/react) into a JSON file.

import git
import json

def commit_summary(c):
    result = {}
    for path, stats in c.stats.files.iteritems():
        for k in stats:
            result[k] = result.get(k, 0) + stats[k]
    result['file_count'] = len(c.stats.files)
    result['committed_date'] = c.committed_date
    result['hexsha'] = c.hexsha
    result['message'] = c.message
    result['email'] = c.author.email
    return result

react = git.Repo('../react')

with open('react-commits.json', 'w') as out:

    commits = react.iter_commits('master')
    index = 0
    for c in commits:
        if index != 0:
        index += 1

        json.dump(commit_summary(c), out)


Then load the JSON file into Pandas.

import pandas as pd

commits = pd.read_json('react-commits.json')
commits['committed'] = pd.to_datetime(commits['committed_date'], unit = 's')

React Commits over Time

The Y-axis have insertions and deletions.

Commits on facebook/react
ggplot(aes('committed', 'insertions'), commits) + \
  geom_line(aes(color = 1)) + \
  geom_line(aes('committed', '-deletions', color = 2)) + \
  ylab('Added/Deleted')  + xlab('Committed Date') + \
  guides(color=False) + scale_color_gradient()

There are a few spikes on deletions (newer to older);

  1. Delete documentation and website source (#11137)
  2. [site] Load libraries from unpkg (#9499)
  3. New Documentation
  4. Merge remote-tracking branch ‘facebook/master’
  5. remove likebutton from docs for now

While most of them were administrative changes, the last, oldest commit was a bit funny;

it has some facebook-ism in there and it’s probably shouldn’t be on the site.

I would agree so :)


Just a small tip, which I didn’t know in the beginning.

plotnine’s save method takes dpi as a parameter.

from plotnine import * # import ggplot(), aes(), ...
g = ggplot(...)
g.save('plot.png', width = 10, height = 10, dpi = 100)
g.save('plot-2x.png', width = 10, height = 10, dpi = 200)

Then you can use srcset to specify the images from HTML.

<img alt="..." src=".../companies.png" srcset=".../companies-2x.png 2x"/>

Also during exploration, I occasionally set plotnine.options.figure_size to have relatively large images.

import plotnine
plotnine.options.figure_size = (20, 20)

Hacker News Trends

Jan 15, 2018

Previously, I compared Hacker News’ commonly upvoted words in 2018 and Deedy Das’s 9 years analysis from 2006 to 2015. In this post, I will take a look annual trends from 2006 to 2017.

The below charts are based on top 1,000 upvoted words. Here is the snippet, and I will publish the full notebook later.

SQL = '''
SELECT word, SUM(score) AS score, SUM(1) AS count FROM
  (SELECT word, score FROM
    (SELECT SPLIT(LOWER(title), ' ') AS words, score FROM
      `bigquery-public-data.hacker_news.full` WHERE EXTRACT(year FROM timestamp) = %d) AS
   CROSS JOIN UNNEST(word_list_and_score.words) AS word) AS words

def popular_words(year):
  query = bq.Query(SQL % (year))
  df = query.execute(output_options=bq.QueryOutput.dataframe()).result()
  df['is_stopword'] = df['word'].apply(lambda x: x in stopwords_set)
  return df[df['is_stopword'] == False]

annual_rankings = {}

for year in range(2006, 2018):
  annual_rankings[year] = popular_words(year)

Please note that zeros in the below charts just mean “not in the top 1,000 words”.

Mega corps vs. “startup”

Hacker News is backed by Y Combinator, a seed accelerator in Silicon Valley. Hacker News’ Wikipedia article explains the site as

Hacker News is a social news website focusing on computer science and entrepreneurship.

However “startup” (technically the sum of “startup” and “startups”) is in downtrend since 2011. I’m unsure either this is because we are at after the end of the startup era, or Hacker News is now mainstream-y than before.

Mega corps vs. startup


Cloud Datalab is Google’s Jupyter fork, which is provided as a part of Google Cloud.

Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of modules and a robust knowledge base.

While that is a fun product to play with, I had a few hiccups.


Cloud Datalab includes a set of well-known Python libraries. One of them is ggplot, which is R’s ggplot2 alternative.

However that’s not the only alternative you can use. I think plotnine is much better and I’m not the only one.

Luckily, you can install plotnine on Datalab easily by running

! pip install plotnine

from Jupyter.


Cloud Datalab doesn’t include Bokeh. Long story short, Bokeh’s latest version’s Notebook integration doesn’t work on Datalab.

According to this GitHub issue,

As of 0.12.9 the minimum supportable notebook version is 5.0. There is no technical path that will allow Bokeh to support JupyterLab, classic Notebook 5+ and Classic Notebook 4.x and earlier at the same time, with identical code in each. Supporting JupyterLab is imperative for the project, so earlier classic notebook versions below 5.0 cannot be supported. You can downgrade Bokeh, or upgrade your notebook (or use JupyterLab).

And Datalab is based on Jupyter 4.2.3 apparently (I checked Jupyter.version from my browser’s JavaScript console).

You can downgrade Bokeh, or install datalab package on your latest, local Jupyter notebook. Hope Google updates Datalab’s Jupyter soon.