Kazu's Log

Kazuyoshi Kato is a Japanese software developer in Seattle, Washington. He is interested in programming languages, data visualization and prototyping.

How was 2017 for you? Have you learned Rust? Have you invested in Bitcoin? Was it the year of Linux desktop (again)?

In this post, I’d like to see how was the last year for fellow hackers, through Hacker News’ public dataset on Google BigQuery.

Google BigQuery has interesting public datasets and Hacker News is one of them. Deedy Das already did a lot of interesting analysis in 2015, that covered 9 years from 2006. Most of the SQL queries below are taken from his post.

Technically people can upvote/downvote last year’s stories on Hacker News. There is no “archive” mode like Reddit as far as I know. If you run the queries again, the numbers would be a bit different.

Top Stories

Let’s start from the simplest one. What was the most upvoted stories in 2017?

I mainly used Google Cloud Datalab to get the numbers. %%bq query lets Datalab (which is basically Jupyter Notebook) interrepts a cell as a SQL on BigQuery. If you use BigQuery’s console, you may need #standardSQL instead.

%%bq query
SELECT title, url, score, id, timestamp FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story' AND extract(year FROM timestamp) = 2017
ORDER BY score DESC LIMIT 30

Here is the result. Apparently, it was the year of Silent Breakers, even for Hacker News.

TitleScoreTimestamp
1Reflecting on one very, very strange year at Uber41072017-02-19 21:16:33
2F.C.C. Repeals Net Neutrality Rules33842017-12-14 18:13:35
3Cloudflare Reverse Proxies Are Dumping Uninitialized Memory32382017-02-23 23:05:08
4Announcing the first SHA-1 collision30302017-02-23 13:01:08
5macOS High Sierra: Anyone can login as “root” with empty password30012017-11-28 19:41:10
6W3C abandons consensus, standardizes DRM, EFF resigns27352017-09-18 19:45:08
7CIA malware and hacking tools26992017-03-07 13:10:56
8Ask HN: Is S3 down?25892017-02-28 17:45:58
9Relicensing React, Jest, Flow, and Immutable.js22802017-09-22 21:51:56
10Seven earth-sized planets discovered circling a star 39 light years from Earth22562017-02-22 18:02:28
11How Firefox Got Fast Again21802017-11-13 14:18:26
12I am an Uber survivor21682017-02-27 20:02:00
13Uber Founder Travis Kalanick Resigns as C.E.O.21192017-06-21 05:24:13
14Get started making music21062017-05-09 13:06:39
15Firefox is on a slippery slope20212017-12-16 16:16:40
16Accidentally Stopping a Global Cyber Attack19812017-05-13 12:13:44
17Google Maps' Moat19332017-12-19 23:51:17
18Launch HN: 70MillionJobs (YC S17) – Job board for people with criminal records18792017-08-02 15:46:39
19Atlassian acquires Trello for $425M18572017-01-09 14:01:49
20GitHub lets staff own IP developed for personal projects using company resources18412017-03-21 12:05:44
21SpaceX to Send Privately Crewed Dragon Spacecraft Beyond the Moon Next Year17472017-02-27 21:27:32
22Tim Berners-Lee wins Turing Award17352017-04-04 11:29:30
23Uber Paid Hackers to Delete Stolen Data on 57M People17202017-11-21 22:02:05
24Google Fires Employee Behind Controversial Diversity Memo16972017-08-08 01:31:15
25Firefox 57.0 Released16922017-11-14 13:52:34
26Amazon to Acquire Whole Foods for $13.7B16872017-06-16 13:03:09
27It’s time to give Firefox another chance16742017-09-29 19:27:58
28Net Neutrality Day of Action: Help Preserve the Open Internet16642017-07-12 17:51:25
29Sublime Text 3.016532017-09-13 13:13:32
30Machine Learning 101 slidedeck: 2 years of headbanging, so you don't have to16512017-12-14 00:54:14

Commonly Shared/Upvoted Domains

As Deddy did, let’s see the commonly shared and upvoted domains.

%%bq query
SELECT
  domains_2017.domain,
  COUNT(1) AS count,
  SUM(score) AS score
FROM (
  SELECT
    REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
    score
  FROM
    `bigquery-public-data.hacker_news.full`
  WHERE extract(year FROM timestamp) = 2017) domains_2017
GROUP BY
  domains_2017.domain
ORDER BY
  count DESC
LIMIT
  30

The result is not that different from what he got, while there were some new comers like Hacker Noon, Quartz and arXiv.org.

Also note that the top of the actual result was “no domain”, that means “Show HN”, “Ask HN”, … those which don’t have external links.

DomainCountScore
1medium.com17631138515
2github.com15074223278
3youtube.com907639402
4nytimes.com6004144996
5techcrunch.com486793204
6bloomberg.com3970105908
7arstechnica.com303955587
8theguardian.com299351159
9theverge.com255232952
10youtu.be24642527
11hackernoon.com232315092
12goo.gl22642267
13bbc.com211643956
14wired.com200222822
15en.wikipedia.org192314552
16twitter.com186132666
17wsj.com185826364
18washingtonpost.com172547134
19theatlantic.com147726339
20qz.com137623305
21arxiv.org129922885
22reuters.com125129259
23bbc.co.uk124725250
24linkedin.com12154939
25businessinsider.com120612800
26reddit.com118415607
27economist.com103219168
28forbes.com10319085
29cnbc.com100610749

I might be a visual thinker, or simply bad at understanding numbers. Let’s have a scatter plot to see the relationship between the two variables.

While GitHub was still the king of Hacker News in term of the number of upvotes, Medium was the most shared domain in 2017.

The unreadable chuck on the above graph had below. There were a lot of commercial news sites from tech-focused ones (e.g. Verge, Wired) to newspapers (e.g. Wall Street Journal, Washington Post). Also you can see URL shorteners in the frequenty shared but rarely upvoted area.

The area also had Tech in Asia, iAfrikan and Security Affairs, which were constantly shared by few users. The opposite side of the domains are corporate blogs (blog.google, blog.,sdn.microsoft.com) which were not shared frequently, but earned relatively a lot of upvotes.

Commonly Used/Upvoted Words

Let’s see words in the titles.

%%bq query
SELECT word, SUM(score) AS score, SUM(1) AS count FROM
  (SELECT word, score FROM
    (SELECT SPLIT(LOWER(title), ' ') AS words, score FROM
      `bigquery-public-data.hacker_news.full` WHERE EXTRACT(year FROM timestamp) = 2017) AS word_list_and_score
   CROSS JOIN UNNEST(word_list_and_score.words) AS word) AS words_2017
GROUP BY word
ORDER BY score DESC LIMIT 100

The result has a lot of stopwords such as “the”, “to”, “a”, … Instead of manually filtering all of them, I used nltk to filter the words.

import google.datalab.bigquery as bq
import pandas as pd
import nltk

nltk.download('stopwords')
stopwords_set = set(nltk.corpus.stopwords.words('english'))

popular_words = bq.Query('''
SELECT word, SUM(score) AS score, SUM(1) AS count FROM
  (SELECT word, score FROM
    (SELECT SPLIT(LOWER(title), ' ') AS words, score FROM
      `bigquery-public-data.hacker_news.full` WHERE EXTRACT(year FROM timestamp) = 2017) AS word_list_and_score
   CROSS JOIN UNNEST(word_list_and_score.words) AS word) AS words_2017
GROUP BY word
ORDER BY score DESC LIMIT 100
''')
df = popular_words.execute(output_options=bq.QueryOutput.dataframe()).result()
df['is_stopword'] = df['word'].apply(lambda x: x in stopwords_set)

Here is the result. As I previously mentioned, “Show HN”, “Ask HN” were very big in Hacker News. “hn:“, “ask” “show” would be coming from these posts. Having “new” is legitimate since this is Hacker News.

Compared to Deddy’s analysis, the lack of “startup” is sad to mention. Is the golden age of the startup already over?.

WordScoreCount
1hn:37214925460
233122131388
3ask19790914216
4show18211812264
5new14016812149
6google1225446786
7data873197522
8learning760604803
9web722504758
10using721026149
11[pdf]688173939
12code637774174
13open634033674
14programming576022704
15uber549092295
16first534804279
17software509814106
18facebook506613054
19python503852686
20(2016)500801093
21us497123254
22go483282570
23linux453722096
24app453266031
25make452333978
26apple451962981
27use451343698
28years450371906
29people449342432
30machine447002892

Let’s make a scatter plot again. Here is the big picture.

In below, the contrast between “learning” vs. “ai” is interesting. There were a lot of stories about “ai”, but they didn’t get much upvotes compared to “learning”, which would be a part of “machine learning” or “deep learning”.

You can find four programming languages here; JavaScript, Go and Python. There were a lot of JavaScript stories, but they didn’t get much upvotes, compared to Go and Python. JavaScript Fatigue? Go would have Amazon Go and Go as a borad game though.

What’s next?

There are a lot of questions I can ask;

  • I was sometimes comparing Deddy’s 9 years analysis vs. 2017, but it would be interesting to see Year-over-Year to see some trends, like, when did we lose “startup”?
  • I solely checked stories, but how about comments?
  • It was a bit annoying to use ggplot()1 multiple times with different parameters to explore the results. Can I use something different to explore the results interactively?
  • Would it be able to build a web app to explore the results?

I probably will not do all of them, but would try a few. Stay tuned!


  1. By the way, I was using plotnine to have ggplot in Python. [return]