by Andrew Oram
American Reporter Correspondent
January 18, 2010
TRACKING YOU THROUGH THE WILD INTERNET
CAMBRIDGE, Mass. -- Editor's Note: After the Introduction, this is the third of seven parts of an exclusive, 9,000-word series on Identity & The Internet by American Reporter Webmaster Andy Oram.
Thy self thou gav'st, thy own worth then not knowing
Voracious data foraging leads advertisers along two paths. One of their aims is to differentiate you from other people. If vendors know what condiments you put in your lunch or what material you like your boots made from, they can pinpoint their ads and promotions more precisely at you. That's why they love it when you volunteer that information on your blog or social network, just as do the college development staff we examined before.
The companies' second aim is to insert you into a group of people for which they can design a unified marketing campaign. That is, in addition to differentiation, they want demographics.
The first aim, differentiation, is fairly easy to understand. Imagine you are browsing Websites about colic. An observer (and I'll discuss in a moment how observations take place) can file away the reasonable deduction that there is a baby in your life, and can load your browser window with ads for diapers and formula. This is called behavioral advertising.
Since behavioral advertising is normally a pretty smooth operator, you may find it fun to try a little experiment that could lift the curtain on it a bit. Hand your computer over for a few hours to a friend or family member who differs from you a great deal in interests, age, gender, or other traits. (Choose somebody you trust, of course.) Let him or her browse the Web and carry on his or her normal business. When you return and resume your own regular activities, check the ads in your browser windows, which will probably take on a slant you never saw before. Of course, the marketers reading this article will be annoyed that I asked you to pollute their data this way.
Experiences like this might arouse you to be conscious of every online twitch and scratch, just as you may feel in real life in the presence of a security guard whose suspicion you've aroused, or when on stage, or just being a normal teenager. Online, paranoia is level-headedness. Someone indeed is collecting everything they can about you: the amount of time you spend on one page before moving on to the next, the links you click on, the search terms you enter. But it's all being collected by a computer, and no human eyes are ever likely to gaze upon it.
Your identity in the computerized eyes of the advertiser is a strange pastiche of events from your past. As mentioned at the beginning of the article, Google's Dashboard lets you see what Google knows about you, and even remove items - an impressive concession for a company that has mastered better than any other how to collect information on casual Web users and build a business on it. Of course, you have to establish an identity with them before you can check what they know about your identity. This is not the last irony we'll encounter when exploring identity.
But advertisers do more than direct targeting, and I actually find the other path their tracking takes - demographic analysis - more problematic. Let's return to the colicky baby example. Advertisers add you to their collection of known (or assumed) baby caretakers and tag your record with related information to help them understand the general category of "baby care." Anything they know about your age, income, and other traits helps them understand modern parenting.
As I wrote over a decade ago, this kind of data mining typecasts us and encourages us to head down well-worn paths. Unlike differentiation, demographics affect you whether or not you play the game. Even if you don't go online, the activities of other people like you determine how companies judge your needs.
The latest stage in the evolution of demographic data mining is sentiment analysis, which trawls through social networking messages to measure the pulse of the public on some issue chosen by the researcher. A crude application of sentiment analysis is to search for "love" or "hate" followed by a product trademark, but the natural language processing can become amazingly subtle. Once the data is parsed, companies can track, for instance, the immediate reaction to a product release, and then how that reaction changed after a review or ad was widely disseminated. Results affect not only advertising but product development.
Once again, my reaction to sentiment analysis mixes respect for its technical sophistication with worries about what it does to our independence. If you add your voice to the Twittersphere, it may be used by people you'll never know to draw far-reaching conclusions. On the other hand, if you refuse to participate, your opinion will be lost.
Google's Dashboard tells you only what they preserve on you personally, not the aggregated statistics they calculate that presumably include anonymous browsing. But you can peek at those as well, and carry on some rough sentiment analysis of your own, through Google Trends.
Considering all this demographic analysis (behavioral, sentiment, and other) catapults me into a bit of a 21st-Century-style existential crisis. If a marketer is able to combine facts about my age, income, place of birth, and purchases to accurately predict that I'll want a particular song or piece of clothing, how can I flaunt my identity as an autonomous individual?
Perhaps we should resolve to face the brave new world stoically and help the companies pursue their goals. Social networking sites are developing APIs and standards that allow you to copy information easily between them. For instance, there are sites that let you simultaneously post the same message instantly to both Twitter and Facebook. I think we should all step up and use these services. After all, if your off-the-cuff Tweet about your skis from the lounge of a ski resort goes into planning a multimillion-dollar campaign, wouldn't it be irresponsible to send the advertiser mixed messages?
My call to action sounds silly, of course, because the data gathering and analysis will obviously not be swayed by a single Tweet. In fact, sophisticated forms of data mining depend on the recent upsurge of new members onto the forums where the information is collected. The volume of status messages has to be so high that idiosyncrasies get ironed out. And companies must also trust that the margin of error caused by malicious competitors or other actors will be negligible.
We saw in an earlier section that your online presence is signaled by a slim swath of information. At the low end, marketers know only your approximate location through your IP address. At the other extreme they can feast on the data provided by someone who not only logs into a site - creating a persistent identity - but fills out a form with demographic information (which the vendor hopes is truthful).
Most browsing takes place in an identity zone lying between the IP address and the filled-out profile. We saw this zone in my earlier example from the coffee shop. The visitor does not identify himself, but lets the browser accept a cookie by default from each site.
Each cookie - so long as you don't take action to remove one, as I did in my experiment - is returned to the server that left it on your browser. If you use a different browser, the server doesn't know you're the same person, and if a family member uses your browser to visit the same server, it doesn't know you're different people.
Because the browser returns the cookie only to servers from the same domain - say, yahoo.com - that sent the cookie, your identity is automatically segmented. Whatever yahoo.com knows about you, oreilly.com and google.com do not. Servers can also subdivide domains, so that mail.yahoo.com can use the cookie to keep track of your preferred mail settings while weather.yahoo.com serves meteorological information appropriate for your location.
This wall between cookies would seem to protect your browsing and purchasing habits from being dumped into a large vat and served up to advertisers. But for every technical measure protecting privacy, there is another technical trick that clever companies can use to breach privacy. In the case of cookies, the trick exploits the ability of a Web server to display content from multiple domains simultaneously. Such flexibility in serving domains is normally used (aside from tweaks to improve performance) to embed images from one domain in a Web page sent by another, and in particular to embed advertising images.
Now, if advertisers all contract with a single ad agency, such as DoubleClick (the biggest of the online ad companies), all the ads from different vendors are served under the doubleclick.com domain and can retrieve the same cookie. You don't have to click on an ad for the cookie to be returned. Furthermore, each ad knows the page on which it was displayed.
Therefore, if you visit web pages about colic, skis, and Internet privacy at various times, and if DoubleClick shows an ad on each page, it can tell that the same person viewed those disparate topics and use that information to choose ads for future pages you visit. In the United States, unlike other countries, no laws prohibit DoubleClick from sharing that information with anyone it wants. Furthermore, each advertiser knows whether you click on their ad and what activity you carry on subsequently at their site, including any purchases you make and any personal information you fill out in a form.
Put it all together, and you are probably far from anonymous on the Internet. In addition, a more recent form of persistent data, controlled by the popular Flash environment through a technology called local shared objects, makes promiscuous sharing easy and removing the information much harder.
The purchase of DoubleClick in 2007 by Google, which already had more information on individuals than anybody else, spurred a great protest from the privacy community, and the FTC took a hard look before approving the merger. A similar controversy may surround Google's recently announced purchase of AdMob, which provides a service similar to DoubleClick for advertisers on mobile phones.
So far I've just covered everyday corporate treatment of Web browsing and e-commerce. The frontiers of data mining extend far into the rich veins of user content.
Deep packet inspection allows your Internet provider to snoop on your traffic. Normally, the ISP is supposed to look only at the IP address on each packet, but some ISPs check inside the packet's content for various reasons that could redound to your benefit (if it squelches a computer virus) or detriment (if it truncates a file-sharing session). I haven't heard of any ISPs using this kind of inspection for marketing, but many predictions have been aired that we'll cross that frontier.
Governments have been snooping at the hubs that route Internet traffic for years. China simply blocks references to domains, IP addresses, or topics it finds dangerous, and monitors individuals for other suspected behavior. The Bush Administration and American telephone companies got into hot water for collecting large gobs of traffic without a court order. But for years before that, the Echelon project was filtering all international traffic that entered or left the United States and several of its allies.
One alternative to being tossed on the waves of marketing is to join the experiments in Vendor Relationship Management (VRM), which I covered in a recent blog. Although not really implemented anywhere yet, this movement holds out the promise that we can put out bids for what we want and get back proposals for products and services. Later, perhaps, we can charge advertisers for providing detailed personal information about ourselves. Maybe VRM will make us devote more conscious thinking to how we present ourselves online - and how many selves we want to present. These are the subjects of the next section.
Next: You Are Who You Say You Are - If You Say So