Battling System Administrators to Expose Chinese Censorship - How I got Into Web Scraping
It's a long, but fun, story
The Project
I didn’t think that my senior year in college would end in a nightly battle with system administrators of a massive Chinese social media site, but sometimes, life comes at you fast.
Is this some hacker novella that I’m writing instead of a newsletter this week? Not quite. It’s real life, but it isn’t the Brad Pitt thriller you want it to be.
Let’s rewind.
The year is 2018 and I’m finishing off college. My computer science degree required a senior project, which was one or two semesters of working on a more or less large project followed by an end-of-the-year presentation and a paper.
Now, most people chose to write some dinky little website for a local business, but I decided I’d be a little more ambitious with mine.
I had also been studying Mandarin Chinese alongside my computer science studies as part of an intensive language program that gave me an opportunity to spend 3 months in Shanghai to study the language. By the end of the program, I was pretty proficient at spoken and written Chinese, so I really wanted to integrate that into my project.
I was also going through a bit of a libertarian phase, and had noticed the insane amount of censorship in the Chinese information space. So, you’ve got a pseudo-libertarian college student with a caffeine habit, some coding skills and a proficiency in Mandarin…
There’s some cool potential there.
Over the years, I had used a site called Zhihu for homework and to practice the language. Essentially, Zhihu is Quora on steroids. There are boards for everything from card games to politics, cooking to knitting, military matters to sports and everything in between.
Because of the diversity of topics on the site, I knew there was a high probability of something being censored. I had noticed posts here and there would be deleted, sometimes even mine, and I knew there were some boards, specifically politics and foreign affairs, that were probably more censored than others.
The Hypothesis
I started with the hypothesis that political subjects were the most likely to be censored.
China has a pretty hard line on any kind of even remotely dissenting thought being thrown into the abyss of censored language. Protests, strikes, corruption and just about anything else that might run afoul of the Party were very rarely seen on the Chinese internet.
The Plan
So, in order to test my hypothesis, I noted several general boards, which were organized by topic, similar to Reddit’s subreddits, that were likely to be censored: politics, military, current events, foreign affairs. I also noted several boards that were associated with more niche issues, including the board for discussions surrounding Japan, Taiwan, and Hong Kong. Then, I noted several boards I figured were not as likely to be censored: sports, weather, soccer, music, etc.
The plan was to scrape as many posts as possible throughout these topics and record the links to the posts as well as the hashes of the post content. I would also do the same for comments under these posts. I would then create a separate functionality to go back through the scraped posts and comments and check to see if they’d been changed or deleted. Deleted posts I assumed were censored, which obviously is analytically flawed but it’ll do for an undergraduate research project, and changed posts were noted to be “possibly censored” since someone may have just gone back to fix a typo or something.
The Battle
I started getting results pretty much immediately. I soon noted clear patterns that supported my hypothesis: politically sensitive posts were being censored on the boards that I figured they would be.
What I didn’t expect, though, was the battle I would wage with the system administrators for weeks to come after starting up the scraper.
Before we get into that, I want to note, as I teach in my book on web scraper development in the section on ethical scraper development, I was rate limiting my requests and streamlining the collection flow of my web scrapers so as not to overburden the server with too much traffic. I was going as slow as could be reasonably expected with my collection. I was also routing my requests through public proxies, TOR and a proxy layer I had set up through the DigitalOcean API so that I could hopefully avoid the ire of the Zhihu sysadmin.
Little did I know, things were going to be quite a bit more difficult.
The changes happened from the very beginning, but at first they were fairly hamfisted: IP bans happened first, and my list of publicly available proxies dwindled. I added functionality to use third party proxies and VPN’s and implemented my own custom proxy layer, which worked for a bit.
Then, more advanced defenses started to kick in. All of a sudden, the layout of the site itself changed. ID’s in HTML tags that I was using to parse the data from the page started to change, subtly enough to not break the site but important enough to break the parsing on my scraper while I was out at the bar with my friends. I returned in a less than spry state after a long night out to find my scraper ground to a halt, the parser choking on bad data.
It took a while to circumvent that one. I basically had to generalize my parsing logic such that they would have to majorly restructure the data of the site in order to break the parser again.
Next came the slowdowns. At first I thought it was a bad algorithm: since I had written the program to check on posts over time, as more posts were shoved into the MongoDB database, the backend checking interface would have more work to do on each run. I wrote a simple algorithm to phase certain posts out after a certain amount of time to remove them from the database and thus avoid gumming up the back end, thinking this would fix it. The requests, however, were still slowed to a standstill.
I don’t know how the system administrators at the other end of the world were hunting me, but they definitely were. I tried several different tactics, but one seemed to work fairly well. Instead of waiting a static amount of time between requests, say 5 seconds, I would randomize the wait period within a certain range. So, between request 1 and 2, I would wait 3 seconds, then between 3 and 4, I would wait 8 seconds, then 1 second, then 5 seconds, etc. While this did mean some of the requests would be slower, I found that this prevented whatever slowdown the system administrators had implemented and that the scraping process hummed along much faster.
I was still dealing with IP blocks and subtle site changes, but I fell into a repeatable groove with the system administrators across the pond. I found ways to automate my attacks to their defenses, scraping public lists of proxies and rotating the proxies I used out if there were a certain amount of errors or the speed of the scraping process fell below a certain threshold. I reinitialized my TOR routes automatically every X scraping runs, and randomized what sub topics I scraped in what order. Essentially, I was trying to appear as close to a normal user, or at least several different users, as possible.
And, frankly, I won.
The defenses kept up and I still received slowdowns and IP blocks, but they stopped changing the HTML of the site. The scraper continued to run just as I needed it to.
The Findings
My hypothesis was more or less correct at face value: much of the censored conversations on Chinese social media centered around fairly predictable political content. Democracy, protest, freedom of speech and political turmoil were all fairly heavily censored.
However, I was missing a piece of the bigger picture. It wasn’t just obviously political topics: the most censored topics weren’t democracy, freedom of speech or protest. They were social issues.
Around the time I was running this project, Chinese universities were undergoing a sort of microcosm of the #MeToo movement. Highly notable cases of sexual abuse scandals were coming to the fore at some of China’s more prestigious universities, including at least one case that lead the victim to suicide.
Topics surrounding China’s #MeToo were very heavily censored. I saw very few posts on the topic that stayed on the platform for more than an hour, and these were just the posts that were even allowed to be published in the first place. Manual testing showed that certain key words in a post made it so the post couldn’t even be published in the first place, in a way proving that the censorship that I was detecting was only the tip of the iceberg.
China’s #MeToo movement and feminism more broadly were very heavily censored on Chinese social media. Culturally and politically, this makes sense: social movements like these are inherently liberatory in nature and thus are considered political by default. One cannot separate the social nature of, for example, Black Lives Matter from its political nature, even if there are less-political or non-political facets of the movement as well. In a country that sees liberatory and progressive movements as an extreme threat, it makes quite a lot of sense that these movements would attract so much ire of the Chinese censorship apparatus.
Due to the gravity of the movement and the tragedy of its stifling by state and private censorship, it’s difficult to say that I was “lucky” to have run this study when I did, but I can also guarantee that there were other social movements and topics that I missed during the study. Due to the volume of data involved in studying the issue of censorship in China, and the fact that I was doing the project solo, I can’t verify what other dark secrets were being hidden by Chinese censors at the time, or even still.
The Lessons Learned
This was my first foray into web scraping at scale. Since then, I have launched a fairly successful course teaching web scraping with Python and I am currently working on a book on the topic as well, which is available for purchase in its work-in-progress form on LeanPub.
The lessons that I learned from this project went far deeper than algorithms and codebases. I learned of the power that a single developer can have in utilizing code for good. It sent me down a lifelong focus on liberatory research and development and pushed me into the career field I’m in today in information security. It strengthened my views on the importance of shining a light on the dark places in tech, freedom of speech on the web and our role in combatting censorship and oppression, both online and in “meatspace.”
This is a fairly new newsletter, and as such is difficult to point at a central theme of the issues thus far, especially given my general disorganization. If I were to pick a central theme, though, it would be liberatory technology. I believe it is all of our duty to use technology for good, and to use it to combat the things that are making the world worse. This was the first of many steps I’ve taken thus far to aid in these efforts, and I plan on using my platform and expertise to push the world in the right direction.
I hope you’ll join me on this journey, even if just as a passive observer, by subscribing to this newsletter. I hope even more you’ll become a more active participant in this loose movement.