Brewster Kahle has been at this a long time.
Consider the photo above evidence. (And yes, children, computer monitors were once the size of a mini-fridge.) It was taken by internet legend (and open records hero) Carl Malamud in December 1991, when he was reporting out what would become Exploring the Internet: A Technical Travelogue, which aimed to put some faces to the burbling sense that something exciting was happening with connected computers.
These were still early days online — only four months after Tim Berners-Lee mentioned his “WorldWideWeb” project for the first time in the newsgroup alt.hypertext. The first version of Netscape was still three years away. And there was Kahle, just 31, but already with a stuffed resume: researcher at MIT’s AI Lab, lead engineer at supercomputer maker Thinking Machines, lead developer of WAIS (Wide Area Information Server), something like an alpha version of what the web would become.
“After delving into the arcana of message-passing protocols for massively parallel processors,” Malamud wrote, “Brewster turned his attention to the much more difficult problem of finding and using information on networks.”
Brewster ushered me into his office, where he sat down on a beat-up old easy chair and balanced a keyboard on his lap. The screen and rollerball mouse were conveniently nearby, making this a highly comfortable work or play station. There was no need to start up his WAIS client since it was already up and running. Deployed for only a few months on the Internet, WAIS was a quickly becoming a part of people’s routines, and had certainly been integrated into Brewster’s daily work.Brewster typed in a query: “Is there any information about Biology?” The query was sent, in its entirety, to the server of servers that Brewster maintained, quake.think.com. Servers of servers were no different than document servers, they simply kept a list of other servers and a description of the information they maintained.
We got back a list of servers throughout the world that had information on biology, such as a database of 981 metabolic intermediate compounds maintained in the Netherlands. At this point, we refined our query and sent it out to many servers simply by pointing to them on the screen. Servers returned lists of document descriptions; pointing to those documents retrieved the full text.
Brewster’s goal was to enable anybody with a computer, even a lowly PC, to become a publisher. The first PC-based WAIS server had recently gone online, running in somebody’s basement, and Brewster was quite excited by the prospect.Brewster’s interest in publishing was personal as well as professional. His fianceé ran a printing museum and in the basement was an old printing press.1
That’s how someone who started out in AI and microchip design ended up being the internet’s librarian.
In 1996, Kahle founded the Internet Archive, which stands alongside Wikipedia as one of the great not-for-profit knowledge-enhancing creations of modern digital technology. You may know it best for the Wayback Machine, its now quarter-century-old tool for deriving some sort of permanent record from the inherently transient medium of the web. (It’s collected 668 billion web pages so far.) But its ambitions extend far beyond that, creating a free-to-all library of 38 million books and documents, 14 million audio recordings, 7 million videos, and more. (Malamud’s book is, of course, among them.)That work has not been without controversy, but it’s an enormous public service — not least to journalists, who rely on it for reporting every day. (Not to mention the Wayback Machine is often the only place to find the first two decades of web-based journalism, most of which has been wiped away from its original URLs.)
A little while back, the Internet Archive celebrated its 25th birthday, and I used that as an excuse to chat with Kahle about how his vision for it had changed along with the internet it tries to preserve in amber — and about why there is still so much human knowledge locked away on microfilm. Here are some bits of our conversation, lightly edited to make me sound more coherent on Zoom calls.
I would have thought that libraries would have just digitized all their books, and that they would have followed the same course as with the digitization of the card catalog. People went and copied their physical card catalogs into software that was running on their machines.
But what really happened was, you know, not as much. We had the Million Books Project.2 We were digitizing away. But then Google Books came along and said, “We’ll take it all.” And that was a complete surprise. And then some people said, “We’ll get the books scanned, but we’ll only share it among ourselves.” That was HathiTrust.3 That I found not that encouraging, in terms of public-spiritedness and the opportunity of the internet to make it available to anybody, anywhere. You know, let’s break open the walls of academia!
There was this guy, Binkley — I really loved Binkley.4 I really wanted to learn more about him. In the 1930s, he was a thinker and a promoter of microfilm — but microfilm as a mechanism of distributing knowledge, specifically to rural populations, to break the city elite. He thought that this was a way of democratizing knowledge.
It turned out that instead, you know, they microfilmed things and mostly kept it just for themselves.
Eugene Power started University Microfilms, and Binkley had this dream of microfilm playing a different role. And basically, Eugene Power won — Binkley died. And we ended up with it being a corporation, which then got bought and bought and bought and bought again. And then they think that, if you want to move something to the next medium, you need to go back and get a new license. That transaction cost is so high, right? You don’t do it very often. So things get left behind because of this idea of licensing.
Being a resource for journalists has been a major goal of ours. We’ve got an internal Slack channel that uses Google Alerts to find uses of the Wayback Machine in news stories, and they come in all the time. I actually find that a useful stream of news to read, because it indicates that the journalist has done some work.
We’d been archiving television, and then we wanted to make it available. And so we tried to make it so you could search on what people said and then make clips. And it didn’t happen as much as I thought it would.
So those are tools that we’ve helped make that are useful to journalists. Then there’s trying to archive news. And we’ve really done a lot work to try to make sure that we capture news from around the world. What’s becoming really tricky now is paywalls and robot traps. Newspapers are becoming very sophisticated to try to make sure that people don’t crawl them. They’re employing more and more sophisticated tools.
You can imagine people coming up with scenarios: “Oh my God, you know, is one copy on the internet going to make it so that we don’t have a business? Oh, wait a minute — that doesn’t happen.” Sort of a theoretical la-la-land of some people’s imaginations. We have a long history and it hasn’t happened, right?
It certainly seems to me, in the 14 years I’ve been at Harvard, that there’s been a very significant push within the university in the direction of open access and pushing back against academic publishers. It feels like, in this incredibly privileged institution, at least, that there’s been some movement in a positive direction. I’m curious if there are areas in this whole question of access that you’re seeing positive movements.
War No. 1 was about the plumbing. The ARPANET, evolving into the internet, versus AT&T. We did really well on that. Part of the reason was that AT&T was broken up in 1986, and so it was temporarily enfeebled. It’s now back, and it’s called AT&T again, which is just chutzpah. We now have very few choices for Tier 1 or last-mile solutions. So that was war no. 1.
War No. 2 was about open protocols versus closed. AOL versus the World Wide Web, right? And that was about Stallman, you know, and Tim Berners-Lee, and to have open protocols — open, free, and open source software. That’s huge, and hugely influential towards not having a Microsoft-dominated, AOL-dominated world. Just draw through line forward from the IBM days, you know — without free and open source software, and protocols that were open, life would have been very different.
That’s still doing okay. But the attacks on free and open source software are being so successful by companies like Facebook and Google. They used a loophole: If you used open-source software, in the old days, you had to go and share whatever you added to it. So, you know, share and share-alike, as Larry Lessig put it. But it only applied for software that you distributed — that, you know, other people could use. But when you started getting cloud services, where you ran all of the software on your own servers, right, you never actually distributed it.
War No. 3 is the content level. That’s always what the Internet Archive has been designed for. And so we’ve had, you know, open educational resources, we’ve had Creative Commons. But I don’t know how successful it’s been in opening up access to academic work. Have the journals shifted over — the key journals in your area, are they open access, or they are they still closed?
At the same time that’s happening, we have social media companies moving in the direction of intentionally impermanent media — you know, a Snapchat Story or Instagram Story that’s designed to disappear after 24 hours. Or Clubhouse, you know — audio conversations meant to be experienced in real time, not time-shifted as a podcast. I’m just wondering how you’ve been thinking about those issues as someone who runs this giant archive designed to keep everything forever.
But something being accessed 10 times or 100 times on the Internet Archive isn’t the same as the mass distribution of being on YouTube. People like to get binary about it, very black and white, but I think you can try to have some level of gray understanding. I say it’s important that it be preserved.
In 1996, things were moving along pretty fast. There’s Google starting up, which would definitely outstrip AltaVista and Inktomi. You had Wikipedia formed in 2001. Why, in 2021, are you still talking about digitizing books? I mean, come on, guys.
Haven’t you applied the AI technologies that we already had? You know, I was at the AI Lab at MIT to help make sense out of what’s going on out there, to go and help give people context. In the words of those days, “context is king.” And where are we on that? Well, that’s rhetorical — we’re almost nowhere on that. And it’s causing huge problems, with people being confused about what it is they’re seeing. Everything looks like a scientific paper. And so you can go and pick one out and find a scientific paper to say whatever it is you want, and then that gets promoted on Fox News.
I think of it as: The internet is an absolutely amazing, astonishing, wonderful thing for people who are sort of lean-forward information consumers. Dedicated infovores, people who enjoy consuming information, who seek it out with purpose and love having access to everything. But if you’re more of a lean-back information consumer, the information you used to get was often of middling quality, but it was still socially responsible in some broad sense. Your local daily newspaper wasn’t going to be pushing QAnon.