Odd Things on The Internet

Do you ever wonder how it is that you are browsing https://www.oilsandsmagazine.com, and a few seconds later, as you browse Amazon for McGregor socks, and Amazon shows you an ad for a Caterpillar 797 Heavy Hauler?

Who doesn’t want a Cat 797? But this is a little creepy.

Sometimes the industry attempts bamboozle you by explaining that such events are just the bizarre little coincidences, enabled by the magic of machine learning (ML) or artificial intelligence. The claim is that they are not tracking you, they are tracking a group of people who exhibit similar actions and interests, and that ML can predict your next action / interest from the actions / interests of the whole group. In the simple case, if the group is searching for morning sickness, sudden infant death syndrome , then the ML will predict that you will soon be searching for diapers, car seats and baby clothes.

While ML can and does work this way, and is used to make surprisingly good suggestions for video games, movies, music, and in the case of Amazon, things you want but didn’t know you wanted, this isn’t how the 797 scenario works.

They are tracking you. But who are they?

Let’s say we want to set up an ad serving business. We’d need to build an ingestion pipeline, so ads could be submitted for sale. The UX / APIs would likely include desired demographic, and prices willing to be paid for better and better demographic targeting. In the most sophisticated systems, these things would be auctioned in realtime to the highest bidder.

Once we have an ad inventory for sale, we’d have to find to places and people willing to serve the ad. The easiest way to accomplish this is to pay people for serving ads. In web pages, and apps – usually mobile apps – since neither Windows nor MacOS rely (much) on ads for monetization. We’d have to build a billing system that registered views (“impressions”) and clicks.

Then we’d need to build a visual ad serving component, one for each of the browsers and important mobile app platforms. All this thing has to do is reliably display an ad, and record if it was in fact displayed, if customer clicked on it, or how far they watched in the case of video, in order to bill the ad supplier the correct amount.

At this point we’d be in business, but not likely profitable. Our ads wouldn’t fetch a big price, because we are not telling the ad supplier much about the target customer, nor were ads targeted at likely interested people.

What we need next is the ability to learn about individual customers.

That’s where trackers come in. A tracker is a bit code that lives in the browser, or a mobile app, continuously watches what you do and uploads it to the tracker / ad network for storage. There are only two hard problems in building a tracker network. The first is reliably identifying each unique customer, and the second incenting web page owners to install your tracker. The latter problem is easy, just share the revenue with them.

The identity is a little bit hard. Unless you are visiting a site where a login is required, the web server only gets an IP address and bit of information about your operating environment (browser+version, operating system+version, cpu, etc..) Seemingly small bananas, privacy wise.

Without going into the details here, most trackers can derive enough information about you and your computer to uniquely fingerprint you. This means that they can assign a stable identifier to you and computer, such that subsequent browsing / app usage on the same computer can be correctly attributed to you. They assign you a identifier; let’s call it a somewhat_annonomous_id. Mine might be 4656879900786 (on one computer).

Now let’s say we’ve got our tracker and our ad serving component running in lots of pages. As I travel around the internet, the tracker follows me and uploads lots of personal information, and stores it via my somewhat_annonomous_id, 4656879900786. When I browse to a page, or app, that has our ad serving component installed, the ad component generates the exact same somewhat_annonomous_id, 4656879900786, that the trackers have been generating. It asks the service to find the ad with the best match / highest impression price. The ad is served, and telemetry monitors click through, video watch time, etc. to derive the final price.

In theory, you see the creepy Caterpillar ad and your privacy is protected. The seller of the ad, the ad serving system, and the page owner who served the ad all don’t know who 4656879900786 is.

Or do they?

The next problem to be solved is to create  a stable identifier that can span devices. The easiest, but not the only way to do this, is to do a cloud JOIN across multiple telemetry streams, correlating information likely to identify a unique person across devices.

Imagine that you have search records for rental cars, hotels and flights to the Balearic Island of Ibiza, with times. And you have, via a somewhat_annonomous_id, credit card history (yes, banks sell this). And maybe some mapping searches, which of course include location information. It’s not too hard to look across all of these sources, all with different, somewhat_annonomous_ids, and conclude that this is the same person. They can then be assigned a master_annonomous_id, which can be associated with all the somewhat_annonomous_ids. This process can be done in batch jobs, and tuned over time as new information arrives, but once the association is established, it can be used in realtime.

The final step in this process is to JOIN the master_annonomous_id with an actual human by name. The easiest way to do this is to incent customers to provide it with value. Stored passwords, form-fill, credit cards, checkout by …, login with facebook …, are valuable things, often requiring a login. Email, Twitter, Instagram, cloud storage all require a login. In theory, this is all customer-friendly and private.

But it gets a bit janky.

Once a password is stored in the browser, it is available for use by code issued from the same source that caches it in the first place. That code can run without your consent, and without actually even generating user experience. There are lots of valid reasons for this, but it also opens a door for trusted code to behave as a tracker with full credentials.

None of the above assumes anything particularly nefarious (other than the invisible cloud JOINs), and has valid business and customer reason to exist.

The facebook graph API was believed to upload information on pages where it was enabled, by not even used. I have no way to prove this was happening, but it is technically possible.

Google has direct access to all of your searches. They have a little more visibility into your behavior via ad “referrals”. Facebook has all of the content you knowingly uploaded. What is exceptionally valuable to both of these companies (and Amazon), is knowledge of everything else you do, across the web and within applications. Something they don’t get without intentional effort.

Long ago, Google released a free analytic suite called Google Analytics. It’s very good. Of course to work, it has to upload detailed click stream data. The value to Google is visibility into the rest of people’s browsing patterns post search, at least at the somewhat_annonomous_id level. Google really wants you to login, and store your credentials. So badly, they will offer substantial amounts of free services in exchage for this.

Looking at Ghostery, I see that msnbc.com loads 15 trackers Disney, 10, National Review, 25, Slate, 30.

There seem to be more trackers than ad networks, so perhaps there is a business simply selling profile information to ad companies.

Fortunately, there’s government to rescue us from nefarious business. About 3 years ago the EU passed a law that required explicit user consent to use browser cookies.

That resulted in billions of clicks on things like this:

And it accomplished nothing. Trackers have not used cookies to fingerprint for years now. Clearing your cookies improves privacy only from the least sophisticated players. The security models of cookies makes the Caterpillar scenario difficult or impossible.

The EU’s GDPR law has a bit more teeth against facebook, Google and Amazon, but still, not much. I spent 6 minutes on facebook trying to find out how to delete all of information, as is required by GDPR. I didn’t find it. Perhaps they have argued that it is all necessary for the experience, and is exempt. Deleting it would paralyze my facebook experience. Losing my order history on Amazon would be inconvenient. And all the tracker and ad network information is basically invisible, anyway. How would one even begin to erase all the tracker data.

Hypothetically, if governments passed a law effectively blocking trackers, all of the advantage would shift to facebook, Google, and to a lessor extent Amazon. They are big enough, and have enough reach, that they are their own tracker network.


Everything in this post is knowable from published ML/AI research and other public sources.