In my attempts to white knuckle through the political root canal that was the American presidential election, I always went to as many different news sites as I could. If a story broke, I’d got to Fox news, CBS, Aljazeera, Politico– even fake news sites like The Onion or CNN. Like so many flecks of glitter on a kindergarten rug, the media’s fake sparkle is all over the web. In traversing this path to something resembling true insight or at least muted hyperbole, I noticed a curious trend in my Google recommendations and targeted ads. Somehow, in it’s majestic calculations, the now King and future God that is Google had erroneously come to the conclusion that I was a card-carrying, flag-waving Tea Party fanatic who despised Paul Ryan. How completely wrong could Google be? If its algorithm were a card player, it was the equivalent to triumphantly yelling “Uno!” in the middle of a blackjack game.
I began to wonder about the implications of a “search history”. Each search is like a breadcrumb marking the path you took to attain knowledge and wisdom. It reveals what you love and what you hate and your triumphs and insecurities. One should at least consider the fact that every search term entered goes down in some permanent record of one kind or another. If I couldn’t really undo my search history then I could at least dump in a bunch of fake searches. Could I make a smoke screen? A white noise generator to render its data useless? If they are going to follow my bread crumbs, I might as well scatter them across the forest. Would randomly generated searches affect my google recommendations or is the big eye of Sergey Brin simply too smart?
My plan of attack was to write a script in Python that could sign in to my Google account, enter my password, and enter search terms into the search bar. It would select a link from the search results and visit that page. Finally, it would log the search term and selected link in a text file for reference after the fact. The script would put itself to sleep for a random amount of time and then repeat. It would do this nonstop until The Singularity! Could this be done without being identified as a bot?
After a lot of thought and even much more googling (the irony, right!?) I put my tools together as follows:
- Jupyter Notebook- I had no serious experience with Chromedriver and I knew I’d have to tackle the tedious task of manipulating the Dom. I could almost hear the army of errors charging down from atop the mighty learning curve above me. Being able to execute code line by line and watching the result as quickly as possible was critical.
- Chromedriver- If you don’t know what it is, it’s basically the go between guy for you and the Chrome browser you wish to automate. Think of yourself as the tough guy gangster, and Chromedriver is the little guy who drives your car. You tell it exactly what to do and it does it. If you tell it to drop the gun and keep the Cannolis it does just that. We need this because we want to run searches night and day, nonstop until we see some change in our Google recommendations. Chromedriver is what will enter the search terms and then drive the browser to a selected link. We are essentially telling our little guy to drive the car all over town, and get people to think we are in the backseat taking care of business.
- Tkinter- I knew I’d have to enter my gmail username and password, and I didn’t want to hard code it knowing I was going to share my code. It also became apparent that I’d have two different modes to run in. One mode to test and demo, and one that actually mimicked the real time between searches and signing in. So I opted for a GUI that prompted the user to enter that info, and decide which mode to run. Why Tkinter? Simply because The New Boston.He’s got the very best, most time efficient tutorials and one was on TKinter.
- Reddit- OK…so what fake searches are we going to feed into my search history? My first thought was to find the most searched terms for that day, or hour, or whatever and then feed that term back into Google. It seemed pretty sinister to pollute my search history with the top most popular searches. However, there was no easy way to automate that without getting into extensive web scraping. Then I thought about using a dictionary and putting in arbitrary words, but I knew I needed the search terms to mimic human searches. Enter our forever curious friends over at r/explainlikeimfive . Essentially this is a place where people want very dumbed down answers to their questions. Reddit has an API! There’s PRAW which is a Python Reddit API Wrapper.
- If at this point you are confused I’d humbly suggest both google and r/explainlikeimfive as highly useful resources.
So here we are with our critical ingredients. We have Google– tracking our every search. We have Chromedriver–our friend who Google can’t tell is alive or dead, and finally we have reddit/r/explainlikeimfive– an eternal source of searchable phrases to use as if our friend Chromedriver is an actual person. Before we know it, we’ll have a Weekend at Bernies Adventure all to ourselves.
Here’s the pseudo code:
- Prompt user for account and password
- Let user select demo or human mode
- Register with reddit API-Hardcoded in script but should be prompted
- Start chromedriver and send it to google.com
- Find sign in button and click it
- Enter username and find “next” button and click it
- Find password field and enter in password
- Enter While True loop
- With reddit API get the 10 newest submissions to r/explainlikeimfive
- Select one of the submissions.
- Chop off first 5 letters of submission.
- Enter a for loop which will put each letter of submission in to search bar to mimic human typing.
- After search is executed collect usable results into a list….there are tons of unusable results!!
- Randomly select url from list of usable results
- Enter that url and go to that webpage.
- Log all results
- Go to sleep for a random amount of time
Below is a video of it running. Please note this is running in “demo mode”, and the Human mode would actually sign in to Google, conduct it searches much slower, and log each search and randomly selected link to a file rather then print it on screen.
Making it was an amazing amount of fun. One thing that really impressed me was how quickly Google indexes Reddit. I found it’s only a matter of minutes from the time a post is submitted to Reddit until Google has it indexed as a search result- thus the need to grab new submissions. It’s also interesting that you could alter this in a way to see which search engine indexes the fastest.
You can view the code and an example of the text file with the logged results on my github here.
In a future post, I’ll discuss the results and see what effect it had on my recommendations and ads. Conversely I may accidently piss off Google, and be the first person they ban.