Mathematicians’ relations and communities with each other

Photo by Omar Flores on Unsplash

Introduction

Graphs have taken a lot of attention during the last years, from graph machine learning methods, including Graph Neural Networks, to Graph Databases. Even on Medium, people are posting graph articles more than before. One thing that I believe most data scientists underestimates is the power of graphs in visualization and storytelling.

Many see graphs as a complex and expensive solution for modeling data science problems. However, they still provide a great way to visualize data that no other chart or visualization can.

In this blog post, I will use the Mathematicians of Wikipedia dataset to investigate relations between mathematicians and their advisors using NetworkX to analyze and visualize.

What is NetworkX?

NetworkX is one of the most popular frameworks to work with. I think it is the scikit-learn of graph world! It is most probable to find NetworkX sooner than any other graph library when you start to learn about graphs and graph frameworks.

NetworkX provides many graph implementations, algorithms, and methods of analysis. Even though it is slow for many real applications, it still offers great functionalities worth learning.

Understaning the data

I have found the dataset from Kaggle. The table below demonstrates all the available features for this dataset that describes mathematicians. Note that the examples do not come from the same row.

The highlighted column names ( mathematicians and doctoral advisor) are the columns in which I am interested in analyzing and showing how mathematicians are connected. So then, my first step is to load the dataset and clean it.

import pandas as pd 
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('../data/data_cleaned.csv')

The first problem with the data is that many of its columns, including doctoral advisor, are a list of names embedded as a string. As I am only interested in the doctoral advisor column, I will only fix this column, but you can add the column names to the list_type_columns to fix it as well.

list_type_columns = [
'doctoral advisor'
]
df[list_type_columns] = df[list_type_columns].fillna('[]')for column in list_type_columns:
df[column] = df[column].str.replace("'", '', regex=False)
df[column] = df[column].str.replace("[", '', regex=False)
df[column] = df[column].str.replace("]", '', regex=False)
df[column] = df[column].str.split(',')
df = df.explode(column)

Now that I have handled my dataset issues, including Null values, by replacing them with empty lists and fixing the string embedded list to real lists, I can save it to a CSV file to use later, so I won’t need to process my dataset every time.

df.dropna(inplace=True)
df[['mathematicians', 'doctoral advisor']].to_csv('../data/adv.csv', index=False)

I only saved the columns I needed, including mathematicians and doctoral advisor.

PageRank Algorithm

PageRank computes the rank of each node based on the number of incoming edges. Google initially developed PageRank to calculate the importance of web pages.

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Using RankPage in NetworkX is really easy:

G = nx.DiGraph(nx.path_graph(4)) 
pr = nx.pagerank(G, alpha=0.9)

Graph Visualization has more insights

Now that we have loaded our dataset, cleaned it, and fixed its issues, let’s see the top mathematicians with the most students.

top_ten_advisor = df['doctoral advisor'].value_counts().sort_values(ascending=False).head(15)
fig, ax = plt.subplots()
ax.barh(top_ten_advisor.index, top_ten_advisor.values)
ax.set_xlabel('Number of Students')
ax.axvline(x=13, c='red')
plt.show()

output:

Even though the above image shows the most famous mathematicians and their students, it does not describe how they are connected and leaves lots of questions unanswered:

  1. Are these top advisors connected by a shared student?
  2. Are these top advisors also advisors of each other?
  3. Do students also advise other students?

To address the questions above, a graph representation can help us. First, let’s drop all the advisors with less than 13 students.

minimum_count = 13
df = df[df['doctoral advisor'].isin(df['doctoral advisor'].value_counts()[df['doctoral advisor'].value_counts() >= minimum_count].index)]

We create our graph in `NetworkX` and calculate the PageRanks ratings.

graph = nx.DiGraph()
graph.add_nodes_from(np.unique(df.values.flatten()))
graph.add_edges_from(df.values)
pr = nx.pagerank(graph)names, ranks = zip(*pr.items())
pr_df = pd.DataFrame(data={'mathematicians': names, 'rank': ranks})
pr_df

Now that we have both graphs and ranks, we can draw.

fig = plt.figure(1, figsize=(30, 20), dpi=100)pos = nx.spring_layout(graph, k=1.1*1/np.sqrt(len(graph.nodes())), iterations=20)
nx.draw(graph, node_size=pr_df['rank'].values*10000, with_labels=True, pos=pos, edge_color='gray')
plt.show()

Originally published at https://kiani.info.

--

--

--

Machine Learning Engineer, kiani.info

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Union-Find

If you don't know what hunting is,then read this.

Data Integrity: How do we identify irregularities across multiple sources of data?

Skymatics Puts Computer Vision to Work, Providing Insights into the Causes of Crop Damage

NLP - How It Will Help Your Business Thrive?

Is Pentaho Data Integration a low hanging fruit to be grabbed

Pentaho data integration

Predicting Snowmelt Patterns with Deep Learning and Satellite Imagery

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kiarash Kiani

Kiarash Kiani

Machine Learning Engineer, kiani.info

More from Medium

Speech recognition and text reinterpretation for clear and error-proof text

“combines CPU and GPU cores onto a single package for data centers and AI” revolutionizing…

Challenges facing a computational biologist in a core facility

What Is Concept Drift And Why Does It Go Undetected?