Looking back 2017, through Hacker News

How was 2017 for you? Have you learned Rust? Have you invested in Bitcoin? Was it the year of Linux desktop (again)?

In this post, I’d like to see how was the last year for fellow hackers, through Hacker News’ public dataset on Google BigQuery.

Google BigQuery has interesting public datasets and Hacker News is one of them. Deedy Das already did a lot of interesting analysis in 2015, that covered 9 years from 2006. Most of the SQL queries below are taken from his post.

Technically people can upvote/downvote last year’s stories on Hacker News. There is no “archive” mode like Reddit as far as I know. If you run the queries again, the numbers would be a bit different.

Top Stories

Let’s start from the simplest one. What was the most upvoted stories in 2017?

I mainly used Google Cloud Datalab to get the numbers. %%bq query lets Datalab (which is basically Jupyter Notebook) interrepts a cell as a SQL on BigQuery. If you use BigQuery’s console, you may need #standardSQL instead.

%%bq query
SELECT title, url, score, id, timestamp FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story' AND extract(year FROM timestamp) = 2017
ORDER BY score DESC LIMIT 30

Here is the result. Apparently, it was the year of Silent Breakers, even for Hacker News.

	Title	Score	Timestamp
1	Reflecting on one very, very strange year at Uber	4107	2017-02-19 21:16:33
2	F.C.C. Repeals Net Neutrality Rules	3384	2017-12-14 18:13:35
3	Cloudflare Reverse Proxies Are Dumping Uninitialized Memory	3238	2017-02-23 23:05:08
4	Announcing the first SHA-1 collision	3030	2017-02-23 13:01:08
5	macOS High Sierra: Anyone can login as “root” with empty password	3001	2017-11-28 19:41:10
6	W3C abandons consensus, standardizes DRM, EFF resigns	2735	2017-09-18 19:45:08
7	CIA malware and hacking tools	2699	2017-03-07 13:10:56
8	Ask HN: Is S3 down?	2589	2017-02-28 17:45:58
9	Relicensing React, Jest, Flow, and Immutable.js	2280	2017-09-22 21:51:56
10	Seven earth-sized planets discovered circling a star 39 light years from Earth	2256	2017-02-22 18:02:28
11	How Firefox Got Fast Again	2180	2017-11-13 14:18:26
12	I am an Uber survivor	2168	2017-02-27 20:02:00
13	Uber Founder Travis Kalanick Resigns as C.E.O.	2119	2017-06-21 05:24:13
14	Get started making music	2106	2017-05-09 13:06:39
15	Firefox is on a slippery slope	2021	2017-12-16 16:16:40
16	Accidentally Stopping a Global Cyber Attack	1981	2017-05-13 12:13:44
17	Google Maps' Moat	1933	2017-12-19 23:51:17
18	Launch HN: 70MillionJobs (YC S17) – Job board for people with criminal records	1879	2017-08-02 15:46:39
19	Atlassian acquires Trello for $425M	1857	2017-01-09 14:01:49
20	GitHub lets staff own IP developed for personal projects using company resources	1841	2017-03-21 12:05:44
21	SpaceX to Send Privately Crewed Dragon Spacecraft Beyond the Moon Next Year	1747	2017-02-27 21:27:32
22	Tim Berners-Lee wins Turing Award	1735	2017-04-04 11:29:30
23	Uber Paid Hackers to Delete Stolen Data on 57M People	1720	2017-11-21 22:02:05
24	Google Fires Employee Behind Controversial Diversity Memo	1697	2017-08-08 01:31:15
25	Firefox 57.0 Released	1692	2017-11-14 13:52:34
26	Amazon to Acquire Whole Foods for $13.7B	1687	2017-06-16 13:03:09
27	It’s time to give Firefox another chance	1674	2017-09-29 19:27:58
28	Net Neutrality Day of Action: Help Preserve the Open Internet	1664	2017-07-12 17:51:25
29	Sublime Text 3.0	1653	2017-09-13 13:13:32
30	Machine Learning 101 slidedeck: 2 years of headbanging, so you don't have to	1651	2017-12-14 00:54:14

Commonly Shared/Upvoted Domains

As Deddy did, let’s see the commonly shared and upvoted domains.

%%bq query
SELECT
  domains_2017.domain,
  COUNT(1) AS count,
  SUM(score) AS score
FROM (
  SELECT
    REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
    score
  FROM
    `bigquery-public-data.hacker_news.full`
  WHERE extract(year FROM timestamp) = 2017) domains_2017
GROUP BY
  domains_2017.domain
ORDER BY
  count DESC
LIMIT
  30

The result is not that different from what he got, while there were some new comers like Hacker Noon, Quartz and arXiv.org.

Also note that the top of the actual result was “no domain”, that means “Show HN”, “Ask HN”, … those which don’t have external links.

	Domain	Count	Score
1	medium.com	17631	138515
2	github.com	15074	223278
3	youtube.com	9076	39402
4	nytimes.com	6004	144996
5	techcrunch.com	4867	93204
6	bloomberg.com	3970	105908
7	arstechnica.com	3039	55587
8	theguardian.com	2993	51159
9	theverge.com	2552	32952
10	youtu.be	2464	2527
11	hackernoon.com	2323	15092
12	goo.gl	2264	2267
13	bbc.com	2116	43956
14	wired.com	2002	22822
15	en.wikipedia.org	1923	14552
16	twitter.com	1861	32666
17	wsj.com	1858	26364
18	washingtonpost.com	1725	47134
19	theatlantic.com	1477	26339
20	qz.com	1376	23305
21	arxiv.org	1299	22885
22	reuters.com	1251	29259
23	bbc.co.uk	1247	25250
24	linkedin.com	1215	4939
25	businessinsider.com	1206	12800
26	reddit.com	1184	15607
27	economist.com	1032	19168
28	forbes.com	1031	9085
29	cnbc.com	1006	10749

I might be a visual thinker, or simply bad at understanding numbers. Let’s have a scatter plot to see the relationship between the two variables.

While GitHub was still the king of Hacker News in term of the number of upvotes, Medium was the most shared domain in 2017.

The unreadable chuck on the above graph had below. There were a lot of commercial news sites from tech-focused ones (e.g. Verge, Wired) to newspapers (e.g. Wall Street Journal, Washington Post). Also you can see URL shorteners in the frequenty shared but rarely upvoted area.

The area also had Tech in Asia, iAfrikan and Security Affairs, which were constantly shared by few users. The opposite side of the domains are corporate blogs (blog.google, blog.,sdn.microsoft.com) which were not shared frequently, but earned relatively a lot of upvotes.

Commonly Used/Upvoted Words

Let’s see words in the titles.

%%bq query
SELECT word, SUM(score) AS score, SUM(1) AS count FROM
  (SELECT word, score FROM
    (SELECT SPLIT(LOWER(title), ' ') AS words, score FROM
      `bigquery-public-data.hacker_news.full` WHERE EXTRACT(year FROM timestamp) = 2017) AS word_list_and_score
   CROSS JOIN UNNEST(word_list_and_score.words) AS word) AS words_2017
GROUP BY word
ORDER BY score DESC LIMIT 100

The result has a lot of stopwords such as “the”, “to”, “a”, … Instead of manually filtering all of them, I used nltk to filter the words.

import google.datalab.bigquery as bq
import pandas as pd
import nltk

nltk.download('stopwords')
stopwords_set = set(nltk.corpus.stopwords.words('english'))

popular_words = bq.Query('''
SELECT word, SUM(score) AS score, SUM(1) AS count FROM
  (SELECT word, score FROM
    (SELECT SPLIT(LOWER(title), ' ') AS words, score FROM
      `bigquery-public-data.hacker_news.full` WHERE EXTRACT(year FROM timestamp) = 2017) AS word_list_and_score
   CROSS JOIN UNNEST(word_list_and_score.words) AS word) AS words_2017
GROUP BY word
ORDER BY score DESC LIMIT 100
''')
df = popular_words.execute(output_options=bq.QueryOutput.dataframe()).result()
df['is_stopword'] = df['word'].apply(lambda x: x in stopwords_set)

Here is the result. As I previously mentioned, “Show HN”, “Ask HN” were very big in Hacker News. “hn:“, “ask” “show” would be coming from these posts. Having “new” is legitimate since this is Hacker News.

Compared to Deddy’s analysis, the lack of “startup” is sad to mention. Is the golden age of the startup already over?.

	Word	Score	Count
1	hn:	372149	25460
2	–	331221	31388
3	ask	197909	14216
4	show	182118	12264
5	new	140168	12149
6	google	122544	6786
7	data	87319	7522
8	learning	76060	4803
9	web	72250	4758
10	using	72102	6149
11	[pdf]	68817	3939
12	code	63777	4174
13	open	63403	3674
14	programming	57602	2704
15	uber	54909	2295
16	first	53480	4279
17	software	50981	4106
18	facebook	50661	3054
19	python	50385	2686
20	(2016)	50080	1093
21	us	49712	3254
22	go	48328	2570
23	linux	45372	2096
24	app	45326	6031
25	make	45233	3978
26	apple	45196	2981
27	use	45134	3698
28	years	45037	1906
29	people	44934	2432
30	machine	44700	2892

Let’s make a scatter plot again. Here is the big picture.

In below, the contrast between “learning” vs. “ai” is interesting. There were a lot of stories about “ai”, but they didn’t get much upvotes compared to “learning”, which would be a part of “machine learning” or “deep learning”.

You can find four programming languages here; JavaScript, Go and Python. There were a lot of JavaScript stories, but they didn’t get much upvotes, compared to Go and Python. JavaScript Fatigue? Go would have Amazon Go and Go as a borad game though.

What’s next?

There are a lot of questions I can ask;

I was sometimes comparing Deddy’s 9 years analysis vs. 2017, but it would be interesting to see Year-over-Year to see some trends, like, when did we lose “startup”?
I solely checked stories, but how about comments?
It was a bit annoying to use ggplot()¹ multiple times with different parameters to explore the results. Can I use something different to explore the results interactively?
Would it be able to build a web app to explore the results?

I probably will not do all of them, but would try a few. Stay tuned!

By the way, I was using plotnine to have ggplot in Python. ↩︎

Source: posts/2018/hacker-news-2017.md