Data Wrangling GitHub Stats with Viz




Data Wrangling

I recently updated Viz with data from GitHub repos created in 2016. Below are some details of how the raw data is wrangled into visualizations.

Data Flow

Mining data directly from GitHub, Viz is powered by the GitHub API and leverages the following:

In the future, Google BigQuery along with GitHub Archive could also supplement the GitHub API.

Imports

import re

import pandas as pd

Prepare Repo Data

Load the repos data and drop duplicates:

repos = pd.read_csv("data/2016/repos-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', repos.shape)
repos = repos.drop_duplicates(subset='full_name', keep='last')
print('Shape after  dropping duplicates', repos.shape)
repos.head()
Shape before dropping duplicates (8043, 5)
Shape after  dropping duplicates (8040, 5)
full_name stars forks description language
0 yarnpkg/yarn 21060 786 πŸ“¦πŸˆ Fast, reliable, and secure dependency manag... JavaScript
1 facebookincubator/create-react-app 17555 1821 Create React apps with no build configuration. JavaScript
2 zeit/hyper 13618 907 A terminal built on web technologies JavaScript
3 ParsePlatform/parse-server 12167 3319 Parse-compatible API server module for Node/Ex... JavaScript
4 juliangarnier/anime 10245 539 Javascript Animation Engine JavaScript

Separate out the user and repo from full_name into new columns:

def extract_user(line):
    return line.split('/')[0]

def extract_repo(line):
    return line.split('/')[1]

repos['user'] = repos['full_name'].str[:].apply(extract_user)
repos['repo'] = repos['full_name'].str[:].apply(extract_repo)
print(repos.shape)
repos.head()
(8040, 7)
full_name stars forks description language user repo
0 yarnpkg/yarn 21060 786 πŸ“¦πŸˆ Fast, reliable, and secure dependency manag... JavaScript yarnpkg yarn
1 facebookincubator/create-react-app 17555 1821 Create React apps with no build configuration. JavaScript facebookincubator create-react-app
2 zeit/hyper 13618 907 A terminal built on web technologies JavaScript zeit hyper
3 ParsePlatform/parse-server 12167 3319 Parse-compatible API server module for Node/Ex... JavaScript ParsePlatform parse-server
4 juliangarnier/anime 10245 539 Javascript Animation Engine JavaScript juliangarnier anime

Prepare User Data

Load the users data and drop duplicates:

users = pd.read_csv("data/2016/user-geocodes-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', users.shape)
users = users.drop_duplicates(subset='id', keep='last')
print('Shape after  dropping duplicates', users.shape)
users.head()
Shape before dropping duplicates (5991, 8)
Shape after  dropping duplicates (5991, 8)
id name type location lat long city country
0 symentis symentis GmbH Organization Unterhaching, Munich 48.068918 11.621253 Unterhaching Germany
1 voghDev Olmo Gallegos User Granada 37.177336 -3.598557 Granada Spain
2 wxyyxc1992 ηŽ‹δΈ‹ι‚€ζœˆη†Š(Chevalier) User NanJing 32.060255 118.796877 Nanjing China
3 vermont42 Josh Adams User Berkeley, California 37.871593 -122.272747 Berkeley United States
4 mjosaarinen Markku-Juhani O. Saarinen User NaN NaN NaN NaN NaN

Rename column id to user:

users.rename(columns={'id': 'user'}, inplace=True)
users.head()
user name type location lat long city country
0 symentis symentis GmbH Organization Unterhaching, Munich 48.068918 11.621253 Unterhaching Germany
1 voghDev Olmo Gallegos User Granada 37.177336 -3.598557 Granada Spain
2 wxyyxc1992 ηŽ‹δΈ‹ι‚€ζœˆη†Š(Chevalier) User NanJing 32.060255 118.796877 Nanjing China
3 vermont42 Josh Adams User Berkeley, California 37.871593 -122.272747 Berkeley United States
4 mjosaarinen Markku-Juhani O. Saarinen User NaN NaN NaN NaN NaN

Merge Repo and User Data

Left join repos and users:

repos_users = pd.merge(repos, users, on='user', how='left')
print('Shape repos:', repos.shape)
print('Shape users:', users.shape)
print('Shape repos_users:', repos_users.shape)
repos_users.head()
Shape repos: (8040, 7)
Shape users: (5991, 8)
Shape repos_users: (8040, 14)
full_name stars forks description language user repo name type location lat long city country
0 yarnpkg/yarn 21060 786 πŸ“¦πŸˆ Fast, reliable, and secure dependency manag... JavaScript yarnpkg yarn Yarn Organization NaN NaN NaN NaN NaN
1 facebookincubator/create-react-app 17555 1821 Create React apps with no build configuration. JavaScript facebookincubator create-react-app Facebook Incubator Organization Menlo Park, California 37.452960 -122.181725 Menlo Park United States
2 zeit/hyper 13618 907 A terminal built on web technologies JavaScript zeit hyper ZEIT Organization NaN NaN NaN NaN NaN
3 ParsePlatform/parse-server 12167 3319 Parse-compatible API server module for Node/Ex... JavaScript ParsePlatform parse-server Parse Organization Menlo Park, CA 37.452960 -122.181725 Menlo Park United States
4 juliangarnier/anime 10245 539 Javascript Animation Engine JavaScript juliangarnier anime Julian Garnier User Paris 48.856614 2.352222 Paris France

Tidy Up Repo and User Data

Re-order the columns:

repos_users = repos_users.reindex_axis(['full_name',
                                        'repo',
                                        'description',
                                        'stars',
                                        'forks',
                                        'language',
                                        'user',
                                        'name',
                                        'type',
                                        'location',
                                        'lat',
                                        'long',
                                        'city',
                                        'country'], axis=1)
print(repos_users.shape)
repos_users.head()
(8040, 14)
full_name repo description stars forks language user name type location lat long city country
0 yarnpkg/yarn yarn πŸ“¦πŸˆ Fast, reliable, and secure dependency manag... 21060 786 JavaScript yarnpkg Yarn Organization NaN NaN NaN NaN NaN
1 facebookincubator/create-react-app create-react-app Create React apps with no build configuration. 17555 1821 JavaScript facebookincubator Facebook Incubator Organization Menlo Park, California 37.452960 -122.181725 Menlo Park United States
2 zeit/hyper hyper A terminal built on web technologies 13618 907 JavaScript zeit ZEIT Organization NaN NaN NaN NaN NaN
3 ParsePlatform/parse-server parse-server Parse-compatible API server module for Node/Ex... 12167 3319 JavaScript ParsePlatform Parse Organization Menlo Park, CA 37.452960 -122.181725 Menlo Park United States
4 juliangarnier/anime anime Javascript Animation Engine 10245 539 JavaScript juliangarnier Julian Garnier User Paris 48.856614 2.352222 Paris France

Add Overall Ranks

Rank each element based on number of stars:

repos_users['rank'] = repos_users['stars'].rank(ascending=False)
print(repos_users.shape)
repos_users.head()
(8040, 15)
full_name repo description stars forks language user name type location lat long city country rank
0 yarnpkg/yarn yarn πŸ“¦πŸˆ Fast, reliable, and secure dependency manag... 21060 786 JavaScript yarnpkg Yarn Organization NaN NaN NaN NaN NaN 2
1 facebookincubator/create-react-app create-react-app Create React apps with no build configuration. 17555 1821 JavaScript facebookincubator Facebook Incubator Organization Menlo Park, California 37.452960 -122.181725 Menlo Park United States 4
2 zeit/hyper hyper A terminal built on web technologies 13618 907 JavaScript zeit ZEIT Organization NaN NaN NaN NaN NaN 8
3 ParsePlatform/parse-server parse-server Parse-compatible API server module for Node/Ex... 12167 3319 JavaScript ParsePlatform Parse Organization Menlo Park, CA 37.452960 -122.181725 Menlo Park United States 9
4 juliangarnier/anime anime Javascript Animation Engine 10245 539 JavaScript juliangarnier Julian Garnier User Paris 48.856614 2.352222 Paris France 16

Verify Results: Users

Equivalent GitHub search query: created:2016-01-01..2016-12-31 stars:>=100 user:donnemartin

Note: The data might be slightly off, as the search query will take into account data up to when the query was executed. Data in this notebook was mined on January 1, 2017 to 'freeze' the results for the year 2016. The longer you run the search from January 1, 2017, the larger the discrepancy.

repos_users[repos_users['user'] == 'donnemartin']
full_name repo description stars forks language user name type location lat long city country rank
3692 donnemartin/gitsome gitsome A supercharged Git/GitHub command line interfa... 4482 158 Python donnemartin Donne Martin User Washington, D.C. 38.907192 -77.036871 Washington United States 89.5
3890 donnemartin/viz viz GitHub's most popular repos, interactively vis... 359 27 Python donnemartin Donne Martin User Washington, D.C. 38.907192 -77.036871 Washington United States 2521.0

Verify Results: Python Repos

Equivalent GitHub search query: created:2016-01-01..2016-12-31 stars:>=100 language:python

Note: The data might be slightly off, as the search query will take into account data up to when the query was executed. Data in this notebook was mined on January 1, 2017 to 'freeze' the results for the year 2016. The longer you run the search from January 1, 2017, the larger the discrepancy.

print(repos_users[repos_users['language'] == 'Python'].shape)
repos_users[repos_users['language'] == 'Python'].head()
(866, 15)
full_name repo description stars forks language user name type location lat long city country rank
3681 tensorflow/models models Models built with TensorFlow 10336 2707 Python tensorflow NaN Organization NaN NaN NaN NaN NaN 15
3682 songrotek/Deep-Learning-Papers-Reading-Roadmap Deep-Learning-Papers-Reading-Roadmap Deep Learning papers reading roadmap for anyon... 8707 1123 Python songrotek Flood Sung User China 35.861660 104.195397 NaN China 23
3683 Rochester-NRT/RocAlphaGo RocAlphaGo An independent, student-led replication of Dee... 7597 2111 Python Rochester-NRT NaN Organization NaN NaN NaN NaN NaN 35
3684 alexjc/neural-doodle neural-doodle Turn your two-bit doodles into fine artworks w... 7226 514 Python alexjc Alex J. Champandard User Vienna, Austria 48.208174 16.373819 Vienna Austria 38
3685 p-e-w/maybe maybe :open_file_folder: :rabbit2: :tophat: See wha... 6078 162 Python p-e-w Philipp Emanuel Weidmann User Anywhere the Internet is NaN NaN NaN NaN 52

Verify Results: Overall Repos

Equivalent GitHub search query: created:2016-01-01..2016-12-31 stars:>=100

Note: The data might be slightly off, as the search query will take into account data up to when the query was executed. Data in this notebook was mined on January 1, 2017 to 'freeze' the results for the year 2016. The longer you run the search from January 1, 2017, the larger the discrepancy.

print(repos_users.shape)
repos_users.head()
(8040, 15)
full_name repo description stars forks language user name type location lat long city country rank
0 yarnpkg/yarn yarn πŸ“¦πŸˆ Fast, reliable, and secure dependency manag... 21060 786 JavaScript yarnpkg Yarn Organization NaN NaN NaN NaN NaN 2
1 facebookincubator/create-react-app create-react-app Create React apps with no build configuration. 17555 1821 JavaScript facebookincubator Facebook Incubator Organization Menlo Park, California 37.452960 -122.181725 Menlo Park United States 4
2 zeit/hyper hyper A terminal built on web technologies 13618 907 JavaScript zeit ZEIT Organization NaN NaN NaN NaN NaN 8
3 ParsePlatform/parse-server parse-server Parse-compatible API server module for Node/Ex... 12167 3319 JavaScript ParsePlatform Parse Organization Menlo Park, CA 37.452960 -122.181725 Menlo Park United States 9
4 juliangarnier/anime anime Javascript Animation Engine 10245 539 JavaScript juliangarnier Julian Garnier User Paris 48.856614 2.352222 Paris France 16

Output Results

Write out the results to csv:

users.to_csv('data/2016/users.csv', index=False)
repos_users.to_csv('data/2016/repos-users-geocodes.csv', index=False)
repos_users.to_csv('data/2016/repos-users.csv', index=False)
repos_rank = repos_users.reindex_axis(['full_name', 'rank'], axis=1)
repos_rank.to_csv('data/2016/repos-ranks.csv', index=False)

Visualize in Tableau: