Keep it secret, keep it safe! Preserving anonymity by subverting stylometry
10:15am - 10:45am on Friday, October 5 in PennTop NorthRobin Camille Davis
- Audience Level:
- Beginner
- Slides:
- https://www.robincamille.com/presentations-data/pygotham2018_davis.pdf
- Watch:
- https://youtu.be/JVVebmhCkZI
Overview
Can you ever really write anonymously? Forensic software can identify the author of an anonymous document just by noting unconscious style markers, like how often someone uses the word “if.” In this talk, I’ll show you how this is done — and how you could outsmart deanonymization software.
Description
In this talk, I will introduce you to adversarial stylometry and demonstrate several techniques with a web tool I built that uses Flask, the Natural Language Toolkit (NLTK), and Scikit-learn.
What’s stylometry? If you wish to remain anonymous, you can use any number of privacy technologies, but you could still be identified simply by the words you use. Using machine learning, stylometry can identify authors of anonymous documents by analyzing the frequency of function words (“of” and “was,” for example) and comparing results to known writing samples. Your writing style is therefore uniquely quantifiable and can serve reliably as a biometric. Writers who wish to remain anonymous — like whistleblowers, activists, and cryptocurrency inventors — should consider using “adversarial” stylometric techniques to outsmart authorship attribution software. In this presentation, I will explain how this is possible and demonstrate a few ways to preserve your anonymity, including using a synonym replacer programmed in Python.
As a relatively new programmer, I took advantage of several Python libraries to help me build this tool. I will touch on calculating word frequency with NLTK and using Scikit-learn to classify documents. This talk is geared toward people who want to use Python to analyze, transform, and generate written language.