Data Wrangling
I recently updated Viz with data from GitHub repos created in 2016. Below are some details of how the raw data is wrangled into visualizations.
Data Flow
Mining data directly from GitHub, Viz
is powered by the GitHub API and leverages the following:
github3.py
to access the GitHub API through Python.pandas
in the following IPython Notebook for data wrangling.- Google Maps API through
geocoder
for location data. - Tableau Public for visualizations.*
In the future, Google BigQuery along with GitHub Archive could also supplement the GitHub API.
Imports
import re
import pandas as pd
Prepare Repo Data
Load the repos data and drop duplicates:
repos = pd.read_csv("data/2016/repos-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', repos.shape)
repos = repos.drop_duplicates(subset='full_name', keep='last')
print('Shape after dropping duplicates', repos.shape)
repos.head()
Shape before dropping duplicates (8043, 5)
Shape after dropping duplicates (8040, 5)
full_name | stars | forks | description | language | |
---|---|---|---|---|---|
0 | yarnpkg/yarn | 21060 | 786 | π¦π Fast, reliable, and secure dependency manag... | JavaScript |
1 | facebookincubator/create-react-app | 17555 | 1821 | Create React apps with no build configuration. | JavaScript |
2 | zeit/hyper | 13618 | 907 | A terminal built on web technologies | JavaScript |
3 | ParsePlatform/parse-server | 12167 | 3319 | Parse-compatible API server module for Node/Ex... | JavaScript |
4 | juliangarnier/anime | 10245 | 539 | Javascript Animation Engine | JavaScript |
Separate out the user
and repo
from full_name
into new columns:
def extract_user(line):
return line.split('/')[0]
def extract_repo(line):
return line.split('/')[1]
repos['user'] = repos['full_name'].str[:].apply(extract_user)
repos['repo'] = repos['full_name'].str[:].apply(extract_repo)
print(repos.shape)
repos.head()
(8040, 7)
full_name | stars | forks | description | language | user | repo | |
---|---|---|---|---|---|---|---|
0 | yarnpkg/yarn | 21060 | 786 | π¦π Fast, reliable, and secure dependency manag... | JavaScript | yarnpkg | yarn |
1 | facebookincubator/create-react-app | 17555 | 1821 | Create React apps with no build configuration. | JavaScript | facebookincubator | create-react-app |
2 | zeit/hyper | 13618 | 907 | A terminal built on web technologies | JavaScript | zeit | hyper |
3 | ParsePlatform/parse-server | 12167 | 3319 | Parse-compatible API server module for Node/Ex... | JavaScript | ParsePlatform | parse-server |
4 | juliangarnier/anime | 10245 | 539 | Javascript Animation Engine | JavaScript | juliangarnier | anime |
Prepare User Data
Load the users data and drop duplicates:
users = pd.read_csv("data/2016/user-geocodes-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', users.shape)
users = users.drop_duplicates(subset='id', keep='last')
print('Shape after dropping duplicates', users.shape)
users.head()
Shape before dropping duplicates (5991, 8)
Shape after dropping duplicates (5991, 8)
id | name | type | location | lat | long | city | country | |
---|---|---|---|---|---|---|---|---|
0 | symentis | symentis GmbH | Organization | Unterhaching, Munich | 48.068918 | 11.621253 | Unterhaching | Germany |
1 | voghDev | Olmo Gallegos | User | Granada | 37.177336 | -3.598557 | Granada | Spain |
2 | wxyyxc1992 | ηδΈιζη(Chevalier) | User | NanJing | 32.060255 | 118.796877 | Nanjing | China |
3 | vermont42 | Josh Adams | User | Berkeley, California | 37.871593 | -122.272747 | Berkeley | United States |
4 | mjosaarinen | Markku-Juhani O. Saarinen | User | NaN | NaN | NaN | NaN | NaN |
Rename column id
to user
:
users.rename(columns={'id': 'user'}, inplace=True)
users.head()
user | name | type | location | lat | long | city | country | |
---|---|---|---|---|---|---|---|---|
0 | symentis | symentis GmbH | Organization | Unterhaching, Munich | 48.068918 | 11.621253 | Unterhaching | Germany |
1 | voghDev | Olmo Gallegos | User | Granada | 37.177336 | -3.598557 | Granada | Spain |
2 | wxyyxc1992 | ηδΈιζη(Chevalier) | User | NanJing | 32.060255 | 118.796877 | Nanjing | China |
3 | vermont42 | Josh Adams | User | Berkeley, California | 37.871593 | -122.272747 | Berkeley | United States |
4 | mjosaarinen | Markku-Juhani O. Saarinen | User | NaN | NaN | NaN | NaN | NaN |
Merge Repo and User Data
Left join repos and users:
repos_users = pd.merge(repos, users, on='user', how='left')
print('Shape repos:', repos.shape)
print('Shape users:', users.shape)
print('Shape repos_users:', repos_users.shape)
repos_users.head()
Shape repos: (8040, 7)
Shape users: (5991, 8)
Shape repos_users: (8040, 14)
full_name | stars | forks | description | language | user | repo | name | type | location | lat | long | city | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yarnpkg/yarn | 21060 | 786 | π¦π Fast, reliable, and secure dependency manag... | JavaScript | yarnpkg | yarn | Yarn | Organization | NaN | NaN | NaN | NaN | NaN |
1 | facebookincubator/create-react-app | 17555 | 1821 | Create React apps with no build configuration. | JavaScript | facebookincubator | create-react-app | Facebook Incubator | Organization | Menlo Park, California | 37.452960 | -122.181725 | Menlo Park | United States |
2 | zeit/hyper | 13618 | 907 | A terminal built on web technologies | JavaScript | zeit | hyper | ZEIT | Organization | NaN | NaN | NaN | NaN | NaN |
3 | ParsePlatform/parse-server | 12167 | 3319 | Parse-compatible API server module for Node/Ex... | JavaScript | ParsePlatform | parse-server | Parse | Organization | Menlo Park, CA | 37.452960 | -122.181725 | Menlo Park | United States |
4 | juliangarnier/anime | 10245 | 539 | Javascript Animation Engine | JavaScript | juliangarnier | anime | Julian Garnier | User | Paris | 48.856614 | 2.352222 | Paris | France |
Tidy Up Repo and User Data
Re-order the columns:
repos_users = repos_users.reindex_axis(['full_name',
'repo',
'description',
'stars',
'forks',
'language',
'user',
'name',
'type',
'location',
'lat',
'long',
'city',
'country'], axis=1)
print(repos_users.shape)
repos_users.head()
(8040, 14)
full_name | repo | description | stars | forks | language | user | name | type | location | lat | long | city | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yarnpkg/yarn | yarn | π¦π Fast, reliable, and secure dependency manag... | 21060 | 786 | JavaScript | yarnpkg | Yarn | Organization | NaN | NaN | NaN | NaN | NaN |
1 | facebookincubator/create-react-app | create-react-app | Create React apps with no build configuration. | 17555 | 1821 | JavaScript | facebookincubator | Facebook Incubator | Organization | Menlo Park, California | 37.452960 | -122.181725 | Menlo Park | United States |
2 | zeit/hyper | hyper | A terminal built on web technologies | 13618 | 907 | JavaScript | zeit | ZEIT | Organization | NaN | NaN | NaN | NaN | NaN |
3 | ParsePlatform/parse-server | parse-server | Parse-compatible API server module for Node/Ex... | 12167 | 3319 | JavaScript | ParsePlatform | Parse | Organization | Menlo Park, CA | 37.452960 | -122.181725 | Menlo Park | United States |
4 | juliangarnier/anime | anime | Javascript Animation Engine | 10245 | 539 | JavaScript | juliangarnier | Julian Garnier | User | Paris | 48.856614 | 2.352222 | Paris | France |
Add Overall Ranks
Rank each element based on number of stars:
repos_users['rank'] = repos_users['stars'].rank(ascending=False)
print(repos_users.shape)
repos_users.head()
(8040, 15)
full_name | repo | description | stars | forks | language | user | name | type | location | lat | long | city | country | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yarnpkg/yarn | yarn | π¦π Fast, reliable, and secure dependency manag... | 21060 | 786 | JavaScript | yarnpkg | Yarn | Organization | NaN | NaN | NaN | NaN | NaN | 2 |
1 | facebookincubator/create-react-app | create-react-app | Create React apps with no build configuration. | 17555 | 1821 | JavaScript | facebookincubator | Facebook Incubator | Organization | Menlo Park, California | 37.452960 | -122.181725 | Menlo Park | United States | 4 |
2 | zeit/hyper | hyper | A terminal built on web technologies | 13618 | 907 | JavaScript | zeit | ZEIT | Organization | NaN | NaN | NaN | NaN | NaN | 8 |
3 | ParsePlatform/parse-server | parse-server | Parse-compatible API server module for Node/Ex... | 12167 | 3319 | JavaScript | ParsePlatform | Parse | Organization | Menlo Park, CA | 37.452960 | -122.181725 | Menlo Park | United States | 9 |
4 | juliangarnier/anime | anime | Javascript Animation Engine | 10245 | 539 | JavaScript | juliangarnier | Julian Garnier | User | Paris | 48.856614 | 2.352222 | Paris | France | 16 |
Verify Results: Users
Equivalent GitHub search query: created:2016-01-01..2016-12-31 stars:>=100 user:donnemartin
Note: The data might be slightly off, as the search query will take into account data up to when the query was executed. Data in this notebook was mined on January 1, 2017 to 'freeze' the results for the year 2016. The longer you run the search from January 1, 2017, the larger the discrepancy.
repos_users[repos_users['user'] == 'donnemartin']
full_name | repo | description | stars | forks | language | user | name | type | location | lat | long | city | country | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3692 | donnemartin/gitsome | gitsome | A supercharged Git/GitHub command line interfa... | 4482 | 158 | Python | donnemartin | Donne Martin | User | Washington, D.C. | 38.907192 | -77.036871 | Washington | United States | 89.5 |
3890 | donnemartin/viz | viz | GitHub's most popular repos, interactively vis... | 359 | 27 | Python | donnemartin | Donne Martin | User | Washington, D.C. | 38.907192 | -77.036871 | Washington | United States | 2521.0 |
Verify Results: Python Repos
Equivalent GitHub search query: created:2016-01-01..2016-12-31 stars:>=100 language:python
Note: The data might be slightly off, as the search query will take into account data up to when the query was executed. Data in this notebook was mined on January 1, 2017 to 'freeze' the results for the year 2016. The longer you run the search from January 1, 2017, the larger the discrepancy.
print(repos_users[repos_users['language'] == 'Python'].shape)
repos_users[repos_users['language'] == 'Python'].head()
(866, 15)
full_name | repo | description | stars | forks | language | user | name | type | location | lat | long | city | country | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3681 | tensorflow/models | models | Models built with TensorFlow | 10336 | 2707 | Python | tensorflow | NaN | Organization | NaN | NaN | NaN | NaN | NaN | 15 |
3682 | songrotek/Deep-Learning-Papers-Reading-Roadmap | Deep-Learning-Papers-Reading-Roadmap | Deep Learning papers reading roadmap for anyon... | 8707 | 1123 | Python | songrotek | Flood Sung | User | China | 35.861660 | 104.195397 | NaN | China | 23 |
3683 | Rochester-NRT/RocAlphaGo | RocAlphaGo | An independent, student-led replication of Dee... | 7597 | 2111 | Python | Rochester-NRT | NaN | Organization | NaN | NaN | NaN | NaN | NaN | 35 |
3684 | alexjc/neural-doodle | neural-doodle | Turn your two-bit doodles into fine artworks w... | 7226 | 514 | Python | alexjc | Alex J. Champandard | User | Vienna, Austria | 48.208174 | 16.373819 | Vienna | Austria | 38 |
3685 | p-e-w/maybe | maybe | :open_file_folder: :rabbit2: :tophat: See wha... | 6078 | 162 | Python | p-e-w | Philipp Emanuel Weidmann | User | Anywhere the Internet is | NaN | NaN | NaN | NaN | 52 |
Verify Results: Overall Repos
Equivalent GitHub search query: created:2016-01-01..2016-12-31 stars:>=100
Note: The data might be slightly off, as the search query will take into account data up to when the query was executed. Data in this notebook was mined on January 1, 2017 to 'freeze' the results for the year 2016. The longer you run the search from January 1, 2017, the larger the discrepancy.
print(repos_users.shape)
repos_users.head()
(8040, 15)
full_name | repo | description | stars | forks | language | user | name | type | location | lat | long | city | country | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yarnpkg/yarn | yarn | π¦π Fast, reliable, and secure dependency manag... | 21060 | 786 | JavaScript | yarnpkg | Yarn | Organization | NaN | NaN | NaN | NaN | NaN | 2 |
1 | facebookincubator/create-react-app | create-react-app | Create React apps with no build configuration. | 17555 | 1821 | JavaScript | facebookincubator | Facebook Incubator | Organization | Menlo Park, California | 37.452960 | -122.181725 | Menlo Park | United States | 4 |
2 | zeit/hyper | hyper | A terminal built on web technologies | 13618 | 907 | JavaScript | zeit | ZEIT | Organization | NaN | NaN | NaN | NaN | NaN | 8 |
3 | ParsePlatform/parse-server | parse-server | Parse-compatible API server module for Node/Ex... | 12167 | 3319 | JavaScript | ParsePlatform | Parse | Organization | Menlo Park, CA | 37.452960 | -122.181725 | Menlo Park | United States | 9 |
4 | juliangarnier/anime | anime | Javascript Animation Engine | 10245 | 539 | JavaScript | juliangarnier | Julian Garnier | User | Paris | 48.856614 | 2.352222 | Paris | France | 16 |
Output Results
Write out the results to csv:
users.to_csv('data/2016/users.csv', index=False)
repos_users.to_csv('data/2016/repos-users-geocodes.csv', index=False)
repos_users.to_csv('data/2016/repos-users.csv', index=False)
repos_rank = repos_users.reindex_axis(['full_name', 'rank'], axis=1)
repos_rank.to_csv('data/2016/repos-ranks.csv', index=False)
Visualize in Tableau: