Archive for June 2008
Wired magazine has a new article that’s worth a read.
And a laugh or two. Yes, there’s some interesting stuff in here, but typical of Wired, it’s hyped.
Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”
This is highly contingent upon your definition of “succeed”.
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
Actually, they don’t. Numbers never speak for themselves.
Any social scientist will tell you that large datasets are fantastic. You can really find some interesting correlations in large populations when you have large enough datasets.
But there are big problems with large datasets, as well. For one thing, they’re never as big as you imagine them to be. There’s always data that exists outside the dataset that might actually be relevant to what you’re trying to figure out.
The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
With you so far, there, Mr. Anderson.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete…
Here’s my bone of contention with this article. While Chris has some interesting points in the rest of the article, and a couple of great observations, this conclusion doesn’t precisely follow… and he does a terrible job of explaining exactly how “this approach to science” has actually been superseded, and by what.
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
Argh. No, correlation is not “enough”. Well, perhaps it is in one sense, more on that follows…
Chris Anderson goes on to give an example which illustrates directly what I’m talking about here:
Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.
This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It’s just data.
… and this represents… what?
If you know something exists because you mine a data set, that does indeed tell you something interesting. Something exists, and you didn’t know about it before.
But as pointed out in this very paragraph… it doesn’t tell you anything about it, other than it exists.
Certainly, this can change how you build a model. Instead of “hypothesize, model, test”, you now have “mine, discover, [something]“… but you have to have the “[something]” step, which he doesn’t talk about in this article, at all. Otherwise, all you’ve done is discover the existence of a previously unknown thing. The significance of the thing is still unknown.
[Edited to Add]: Ed Felten’s commentary on the article.
I’ve had this argument before with other IT professionals and it’s nice to have some empirical evidence to back up my contention… that when it comes to organizational IT…
Patching vulnerabilities immediately is not really useful, and is not a substitute for good security.
Both are required reading, but the conclusion:
Collectively, our “Verizon Business 2008 Data Breach Investigations Report”, along with our earlier studies, suggests that getting the right mix of countermeasures in an enterprise is far from simple. Rather than “do more,” all three studies seem to suggest that we should “work smarter.” The Sasser study shows that in some cases working harder seems to not only consume significant resources, but is also sometimes counterproductive. Unfortunately, precious few of us have the data or risk models available to show us exactly how to focus our limited time and resources.
A control like patching, which has very simple and predictable behavior when used on individual computers, (i.e., home computers) seems to have more complex control effectiveness behavior when used in a community of computers (as in our enterprises).
Communities behave differently than individuals.
This reminds me of the differences between individual medicine and community health. After all, you can effectively treat an individual with cholera with a mixture of salt and sugar water, but putting salt and sugar in the drinking water does nothing to reduce cholera in the community.
Every time to deploy a patch, you’re changing software. Usually, the patch works as intended… however, sometimes it introduces a new security vulnerability (this is what happened with the Debian SSL patch), sometimes it has some unintended consequence in service availability (you just broke something your enterprise relies upon… whoops!), and sometimes the patch doesn’t matter in the slightest.
Developing change management controls over your patch management is a necessary step in managing your systems and services. Patching your systems is indeed something that you need to do, but having decent security controls in place is going to be a better use of your time.
On clients, system-wide automatic updates are fine. On servers… it’s something else altogether.
I’ve noticed that Megan has added a regular feature to her blog – a sidebar section called “Mcgeeisms”, with a quote from one of the Travis books. I’ve noticed as well that the quote has changed with about the regularity I would expect if Megan was undergoing a literary task I myself underwent about 8 years ago… reading all of the Travis McGee books in a block.
She’s on Dress Her in Indigo, which is halfway through the series chronologically. I don’t know if she’s actually reading them in order, or just in the order in which they sit on her bookshelf (Megan is not quite as overly organized as I have been accused of being when it comes to things like books, movies, CDs… etc).
Meg, I hope you’re throwing some other things in there, and not just reading Travis. If you are, by the time you get to A Tan and Sandy Silence (or 13 books in, if you’re just reading them in no particular order) you’re going to find yourself overwhelmed by a melancholy period that lasts for about 3 months.
Throw some Carl Hiaasen in there, to cut it.
One of my rare family-unfriendly posts.
The only appropriate way for me to deal with the death of my favorite comedic giant is with comedy. For a more serious look at the career of George Carlin, see Matthew Caverhill.
George had a great skit comparing himself to Richard Pryor… I found the direct quote on wikipedia:
“An update on the comedian health sweepstakes. I currently lead Richard Pryor in heart attacks 2 to 1. But Richard still leads me 1 to nothing in burning yourself up. See, it happened like this. First Richard had a heart attack. Then I had a heart attack. Then Richard burned himself up… and I said, ‘Fuck that. I’m having another heart attack!’“
Richard Pryor died three years ago, from a heart attack. George joined him yesterday. Looks like he takes the game, 3-2.
Rest well, you cantankerous old fart.
As promised in the last WW post, here’s the voting record on House Resolution 6304, which passed overwhelmingly… 293-129, with 13 abstentions. Via GovTrack.
In related news, Senator and Democratic Presidental Nominee Barack Obama is jumping on the “it’s perfectly okay for the government to have this wild violation of the fourth amendment enshrined into law” bandwagon. Hope you weren’t looking for my money, sir.
Here’s the few who voted “Nay”, so you know who the sane people are. Notice the absence of Independents and GOP naysayers. Not that the bill wouldn’t have passed with a huge Democratic support, but it always strikes me as odd how the Republican party, with their, “Big Government Bad, Government does everything badly” mantra feel perfectly fine with handing over the authority to monitor basically every electronic communication without a warrant or legal due process to the same government they don’t trust to divvy up taxes, design decent educational programs, provide health care…
HI-1 Abercrombie, Neil [D]
ME-1 Allen, Thomas [D]
NJ-1 Andrews, Robert [D]
WI-2 Baldwin, Tammy [D]
CA-31 Becerra, Xavier [D]
OR-3 Blumenauer, Earl [D]
PA-1 Brady, Robert [D]
IA-1 Braley, Bruce [D]
CA-23 Capps, Lois [D]
MA-8 Capuano, Michael [D]
MO-3 Carnahan, Russ [D]
IN-7 Carson, André [D]
NY-11 Clarke, Yvette [D]
MO-1 Clay, William [D]
TN-9 Cohen, Steve [D]
MI-14 Conyers, John [D]
IL-12 Costello, Jerry [D]
CT-2 Courtney, Joe [D]
MD-7 Cummings, Elijah [D]
IL-7 Davis, Danny [D]
CA-53 Davis, Susan [D]
OR-4 DeFazio, Peter [D]
CO-1 DeGette, Diana [D]
MA-10 Delahunt, William [D]
CT-3 DeLauro, Rosa [D]
MI-15 Dingell, John [D]
TX-25 Doggett, Lloyd [D]
PA-14 Doyle, Michael [D]
MN-5 Ellison, Keith [D]
CA-14 Eshoo, Anna [D]
CA-17 Farr, Sam [D]
PA-2 Fattah, Chaka [D]
CA-51 Filner, Bob [D]
IL-14 Foster, Bill [D]
MA-4 Frank, Barney [D]
TX-20 Gonzalez, Charles [D]
AZ-7 Grijalva, Raul [D]
NY-19 Hall, John [D]
IL-17 Hare, Phil [D]
IN-9 Hill, Baron [D]
NY-22 Hinchey, Maurice [D]
HI-2 Hirono, Mazie [D]
NH-2 Hodes, Paul [D]
NJ-12 Holt, Rush [D]
CA-15 Honda, Michael [D]
OR-5 Hooley, Darlene [D]
WA-1 Inslee, Jay [D]
NY-2 Israel, Steve [D]
IL-2 Jackson, Jesse [D]
TX-18 Jackson-Lee, Sheila [D]
LA-2 Jefferson, William [D]
TX-30 Johnson, Eddie [D]
GA-4 Johnson, Henry [D]
IL-15 Johnson, Timothy [R]
OH-11 Jones, Stephanie [D]
WI-8 Kagen, Steve [D]
OH-9 Kaptur, Marcy [D]
RI-1 Kennedy, Patrick [D]
MI-13 Kilpatrick, Carolyn [D]
OH-10 Kucinich, Dennis [D]
WA-2 Larsen, Rick [D]
CT-1 Larson, John [D]
CA-9 Lee, Barbara [D]
MI-12 Levin, Sander [D]
GA-5 Lewis, John [D]
IA-2 Loebsack, David [D]
CA-16 Lofgren, Zoe [D]
MA-9 Lynch, Stephen [D]
NY-14 Maloney, Carolyn [D]
MA-7 Markey, Edward [D]
CA-5 Matsui, Doris [D]
MN-4 McCollum, Betty [D]
WA-7 McDermott, James [D]
MA-3 McGovern, James [D]
NY-21 McNulty, Michael [D]
FL-17 Meek, Kendrick [D]
ME-2 Michaud, Michael [D]
CA-7 Miller, George [D]
NC-13 Miller, R. [D]
WV-1 Mollohan, Alan [D]
WI-4 Moore, Gwen [D]
VA-8 Moran, James [D]
CT-5 Murphy, Christopher [D]
NY-8 Nadler, Jerrold [D]
CA-38 Napolitano, Grace [D]
MA-2 Neal, Richard [D]
MN-8 Oberstar, James [D]
WI-7 Obey, David [D]
MA-1 Olver, John [D]
NJ-6 Pallone, Frank [D]
NJ-8 Pascrell, William [D]
AZ-4 Pastor, Edward [D]
NJ-10 Payne, Donald [D]
NC-4 Price, David [D]
NY-15 Rangel, Charles [D]
NJ-9 Rothman, Steven [D]
CA-34 Roybal-Allard, Lucille [D]
OH-17 Ryan, Timothy [D]
CA-39 Sanchez, Linda [D]
CA-47 Sanchez, Loretta [D]
MD-3 Sarbanes, John [D]
IL-9 Schakowsky, Janice [D]
PA-13 Schwartz, Allyson [D]
VA-3 Scott, Robert [D]
NY-16 Serrano, José [D]
NH-1 Shea-Porter, Carol [D]
NY-28 Slaughter, Louise [D]
CA-32 Solis, Hilda [D]
CA-12 Speier, Jackie [D]
OH-13 Sutton, Betty [D]
CA-1 Thompson, C. [D]
MA-6 Tierney, John [D]
NY-10 Towns, Edolphus [D]
MA-5 Tsongas, Niki [D]
NM-3 Udall, Tom [D]
MD-8 Van Hollen, Christopher [D]
NY-12 Velazquez, Nydia [D]
MN-1 Walz, Timothy [D]
FL-20 Wasserman Schultz, Debbie [D]
CA-35 Waters, Maxine [D]
CA-33 Watson, Diane [D]
NC-12 Watt, Melvin [D]
CA-30 Waxman, Henry [D]
NY-9 Weiner, Anthony [D]
VT-0 Welch, Peter [D]
FL-19 Wexler, Robert [D]
CA-6 Woolsey, Lynn [D]
OR-1 Wu, David [D]
I haven’t seen Requiem for A Dream, but of course I’ve seen Ferris.
Found on Best Week Ever TV, via Digg, a mashup that shows how someone with talent can take a soundtrack and video and make the whole be quite different from the sum of the parts. The creator of this ought to be snapped up immediately by anybody who builds trailers for a living…
I’ve been posting a bit about this story in previous blog entries, but (conspicuously absent from the front pages of the Los Angeles Times and the New York Times) apparently the House and Senate have agreed on a compromise bill that effectively gives the Administration… everything they were demanding. Some “compromise”.
Text of the proposed legislation is available here. The ACLU weighs in with condemnation here. The EFF’s position is described here.
When this bill passes, as it undoubtedly will, I’ll post the list of blackguards and miscreants who voted to pass the bill here. Of course it will be challenged in court. I’m not sanguine about SCOTUS’s potential position here… about the only outlying hope I have is that the courts don’t like being told to mind their own business, thank you very much, so with luck in a year or two or three some challenge to this will make it to The Show and it will be overturned. Undoubtedly to the screams of “activist judges”.
I am bitterly, bitterly disappointed in Congress. This abdication of the balance of powers to the Executive branch simply makes no sense to me whatsoever. I do not understand how an Administration with an approval rating slightly better than sexually transmitted diseases can continue to score political victories against the opposing party. I’m offended that the Democratic candidate for President, who has a long standing public position of opposing this legislation, has not come out vociferously against this “compromise” bill.
There is simply no justifiable reason for this massive intrusion into the digital world. The government has demonstrated again and again that they are not capable of building reasonably secure systems that have this level of access, egregious misuse is inevitable – this is why we had a FISA court in the first place.
It’s very nearly official. It has been no secret that Big Brother had the capability before, but you soon will be able to state with certainty after this is passed… Big Brother is Watching You.
[edited to add]
Pelosi’s justification, via the Washingon Post story:
Leading Democrats acknowledged that the surveillance legislation is not their preferred approach, but they said their refusal in February to pass the version supported by the Bush administration paved the way for victories on other legislation, such as the war funding bill.
“When they saw that we were unified in sending that bill rather than falling for their scare tactics, I think it sent them a message,” said House Speaker Nancy Pelosi (D-Calif.). “So our leverage was increased because of our Democratic unity in both cases.”
Sorry, I don’t buy it. I understand that “compromise” can mean “give a little here to get a little there” outside the scope of a single piece of legislation. But you can’t publicly quote that as your justification while simultaneously misrepresenting this legislation, and claiming that you’re making a good trade-off.
But, Pelosi argued, the bill also firmly rejects President Bush’s argument that a war-time chief executive has the “inherent authority” on some surveillance activity necessary to fight terrorists. It restores the legal notion that the FISA law is the exclusive rule on surveillance.
From Wired, a rebuttal to that point:
Under the proposal, the intelligence community will be able to issue broad orders to U.S. ISPs, phone companies and online communications services like Hotmail and Skype to turn over all communications that are reasonably believed to involve a non-American who is outside the country. The spy agencies will not have to name their targets or get prior court approval for the surveillance.
[edited to add]
A well-written summary by Kevin Drum of why this whole bill, which has been buried under the whole “amnesty” argument, is a really, really bad idea. From that piece:
Is this useful? Maybe. But we’re not listening in on al-Qaeda’s phone calls to America. We’re tapping the phones of anyone who fits a hazy and seldom accurate profile that NSA finds vaguely suspicious, a profile that inevitably includes plenty of calls in which one end is a U.S. citizen. But the new FISA bill doesn’t require NSA to get a warrant for any of these individuals or groups, it only requires a FISA judge to approve the broad contours of the profiling software. This raises lots of obvious concerns:
- The algorithms that determine NSA’s profiles are almost certainly extremely complex and technical — far beyond the capability of any lawyer to understand. So who gets to decide which algorithms are legitimate and which ones go too far? NSA’s computer programmers?
- What happens to the information that’s collected on the tens of thousands of people who turn out to be innocent bystanders? Is it kept around forever?
- Is this program limited solely to international terrorism? Are you sure? If it works, why not use it to fight drug smuggling, sex slave trafficking, and software piracy?
- Since this program was meant to be completely secret, what mechanism prevents eventual abuse? Because programs like this, even if they’re started with the best intentions, always get abused eventually.
The oversight on this stuff is inherently weak. After all, no court can seriously evaluate algorithms like this and neither can Congress. They don’t have the technical chops. Do the algorithms use ethnic background as one of their parameters? Membership in suspect organizations? Associations with foreigners? Residence in specific neighborhoods? Nobody knows, and no layman can know, because these things most likely emerge from other parameters rather than being used as direct inputs to the algorithm.