Large Scale Graph Mining with Spark: What I learned from mapping >15 million websites

1:30pm - 2:00pm on Saturday, October 6 in Madison

Win Suen

Audience Level:
Intermediate
Watch:
https://youtu.be/LQAaQD2n3u0

Overview

Use the Internet? Fond of graphs? Then this talk may be for you! Learn how to build and explore graphs with Spark GraphFrames. I will be sharing tips, tricks, and gotchas I learned for conducting network analysis with Spark, and diving into the exciting problems you can represent with graphs.

Description

As the web grows ever larger and more content-rich, graph analysis may be one of the most powerful tools for unlocking insights within the mythical big data. That’s totally not fluff, because WIRED wrote about it (https://www.wired.com/insights/2014/03/graph-theory-key-understanding-big-data/).

This talk relates to ongoing research into large-scale graph mining, and to find insights into how different websites interact with each other (sometimes in surprising ways!). Spark GraphFrames was integral to exploring the enormous Common Crawl dataset, and the data size really pushed the tool to its limits. Along the way, I learned a great deal about optimizations in representing and computing graphs.

We’ll talk about:

And much more! Github repo with all code will also be shared.

Want to edit this page?