Large Scale Graph Mining with Spark: What I learned from mapping >15 million websites1:30pm - 2:00pm on Saturday, October 6 in Madison
- Audience Level:
Use the Internet? Fond of graphs? Then this talk may be for you! Learn how to build and explore graphs with Spark GraphFrames. I will be sharing tips, tricks, and gotchas I learned for conducting network analysis with Spark, and diving into the exciting problems you can represent with graphs.
As the web grows ever larger and more content-rich, graph analysis may be one of the most powerful tools for unlocking insights within the mythical big data. That’s totally not fluff, because WIRED wrote about it (https://www.wired.com/insights/2014/03/graph-theory-key-understanding-big-data/).
This talk relates to ongoing research into large-scale graph mining, and to find insights into how different websites interact with each other (sometimes in surprising ways!). Spark GraphFrames was integral to exploring the enormous Common Crawl dataset, and the data size really pushed the tool to its limits. Along the way, I learned a great deal about optimizations in representing and computing graphs.
We’ll talk about:
- Why graphs are so fascinating and the types of problems they can help solve
- How Spark GraphFrames work under the hood.
- How to find clusters of interest in your graph.
- Tips that may help you in your journey (hint: you’re only as good as your data structure).
And much more! Github repo with all code will also be shared.