People Don't Write Programs

Recently I participated in some online discussions about the relative merits of various software verification tools and techniques. This post is not about that. I think that, while different tools and approaches currently differ in the scale of programs they can verify, it is safe to say that, when considering current tools, it’s easy to come up with examples that are relatively easy to deductively prove correct and virtually impossible to model-check, examples that are trivial to model-check and virtually impossible to prove deductively, and examples that are beyond the reach of either. There are examples where reasoning is most easily done following the syntactic composition of a program, and examples where a different, non-syntactic composition makes things easier.

What I do want to discuss is an implicit assumption at the core of all formal methods. The reason for this assumption is that we know of the strong theoretical results showing that software verification in general (i.e., in the worst case) is infeasible. While programs written in some high-level language could have potentially made things easier even in the worst case (as the number of programs expressible as humanly-manageable code is vastly smaller than the total number of hypothetical programs), we know that, sadly, linguistic abstractions do not improve things in general (see my post, Why Writing Correct Software Is Hard for a discussion of important results). And yet, we persist in trying to come up with various tools that could mechanically or semi-mechanically verify our software, including proof assistants, model checkers, abstract-interpretation sound static analysis, unsound static analysis and concolic testing. Some of these tools aim high at achieving bug-free programs (model checkers, formal proofs), some aim a bit lower (concolic testing), and some mostly try to reduce or eliminate certain common and harmful classes of bugs (static analysis, ownership types). The approaches vary both in their depth (whether they aim to find any arbitrary kind of bug or just certain classes) and their breadth (whether they aim to eliminate bugs completely or to reduce their incidence), and can differ widely in cost. The only reason we try so hard despite the grim theoretical limits is that we believe that many or most programs are far from the theoretical worst case. But why do we believe that?

One reason is that we simply have no choice. Maybe most programs are far from the worst-case and maybe they’re not, but until we know for sure, we must try to assume that we can help, hopefully even significantly. This is a reasonable stance, and one that I completely identify with. But such a perspective is essentially humble: we’re not certain of success, but we try to do our best. Because some positions expressed in the various discussions are not so humble, I think there is another, more self-assured belief that underlies the core assumption of software verification, and it is based on a story: People write programs, and those programs often enough work more-or-less as expected. If a person writes a certain program, this means that they have some mental model of why it should work. This mental model must be simple enough for the programmer to hold it in a limited human mind. Either this mental model is true, in which case it should be, in principle, not too hard to translate to a formal proof or verify with a model-checker, or it isn’t, in which case the model-checker would find a counter example and the programmer would despair from the proof. Put simply, the belief is that if a person is smart enough to write a more-or-less working program, then it can be verified by some mechanical means.

While I certainly cannot disprove this belief, I am less certain of its validity, and therefore less certain that it’s a sure guarantee of our eventual ability to mechanically and fully verify real programs. The reason is that the very premise of this story — that people write programs — is wrong.

It is common to see this implicit belief in exasperated statements made by programmers. They say how ridiculous it is that we’ve been “reduced” to treating software systems empirically, as if they were a mysterious natural phenomenon rather than man-made objects. But the truth is that many if not most man-made systems behave in the same way. People make the economy, yet economists are far less successful than natural scientists in modeling the system they study. Politics is made by people, but we are equally hopeless at precisely modeling it.

People don’t write programs like in the story. I may be exaggerating — but not too much — in saying that people write programs only in the same sense that people make the economy, and while writing them they have a mental model with similar accuracy and similar predictive capability as they do of the economy when they participate in it. This mental model is not sufficient for mechanization, even with human guidance. My vague understanding of economics means I’m fairly certain that if I buy a car I won’t topple the economy of Sweden; proving it is another matter. Yet, my inability to precisely model, and therefore mechanically and accurately reason about the economy does not preclude the economy from behaving more-or-less reasonably much of the time.

An average business software has about five million lines of code. It is not uncommon to find software five and ten times that size. The program is written and maintained by a team of scores or even hundreds of programmers at any one time, many more during its entire lifetime; none of them has written “the program.” In a good team, many programmers would have a vague understanding of some significant portion of the system (say, 30-80%) and a thorough understanding of a very small portion of it (5-10%). A programmer doesn’t write a program. She writes a feature — or ten. This feature interacts with the other ten thousand features in the program in ways that are hard for the programmer to predict with precision.

We can even try to understand the kind of reasoning that allows a programmer to correctly implement a feature in order to demonstrate how imprecise and non-mathematical it is — yet how effective. It can be as simple as guessing that large, established systems tend to be stable. Programs are not stable in the sense that in most paradigms any single line could crash the system, but they grow stable as a development process. Most feature addition/changes tend to be local and not require a full rewrite, because if they did, the system wouldn’t have survived until now. Therefore, the effects of my feature are likely to be local, and if they’re not then they’ll be minor, and if not, they will be quickly discovered in testing. The mental model of the people writing features of the program is not analytic at all, but empirical — not too different from that of the program’s users. This is not surprising, because the people writing features may be no more familiar with most of the program than its users. A mathematician would find such an inductive, empirical form of reasoning completely non-mathematical (indeed, one more reason why programming is not mathematics) but: 1. A moment’s thought would show that this empirical argument is 100% correct, 2. The argument is completely sufficient to explain our current ability to write software and its relative success, and yet, 3. This argument is completely insufficient for the claim that a programmer’s informal reasoning could be formalized (it’s not even a deductive argument). In other words, this lizard-brain reasoning is surprisingly effective, yet clearly not very helpful from a verification point of view, because catastrophes happen precisely when this reasoning is wrong, which may happen more often than we’d want. We get by thanks to reasoning that is effective enough for the system to function, yet wrong often enough for us to look for other ways.

At this point, the clever but inexperienced verification fan would say, “Aha! That’s precisely why decomposition is so important! Every software component specifies its interface to the environment of other components, and our programmer can use that to reason mechanically about the feature’s interaction.” Unfortunately, this reasoning is faulty for a number of reasons.

The first is rather technical, but it shows why this whole endeavor ain’t easy. Assuming we were to attach specifications to some syntactic component of the program, the syntactic abstractions offered to us by established programming paradigms, either imperative or functional — both products of the 1930s and ’40s, when notions of computation and computers were very different from those we have today — do not align with how features are composed. For example — forgive me, my examples are always terrible — suppose that a slideshow program has a save button that opens a dialog asking for a file name, and we want to add a feature that if the file name given is not the one from which the slideshow was loaded, and it already exists, the program display a dialog warning about a possible overwrite. Adding this feature to a program composed of subroutines will not entail the addition of new components, but a modification of existing ones. Not only would we need to reason about interaction of specifications of components, we would actually need to change existing specifications, including the specifications of all components that depend on the component we’re changing. Simple solutions, like naming the specification of the save operation (say, giving it a nominal type) and reusing it in all components that depend on it, doesn’t work. Doing so would change the specification of another feature we have, say autosave, and we may not want that specification to change because we do not want the autosave feature to behave in the same way (horrible example, but I hope you get the picture). The complete program may then work according to the formal specification, but it is only because of an automatic and unwanted change to that specification. There are modern programming paradigms that try to make the syntactic or structural composition of programs better match our mental composition of features, like behavioral programming, but they’re not mainstream.1

It is certainly possible that the kind of composition offered by most programming languages is more convenient for writing programs than one that aims to compose by features, but it may not matter because what we want to know when we verify a program aligns with neither. For example, we may want to know all the conditions under which our slideshow program deletes a file, or to verify it never deletes a file unless a warning dialog is displayed. Answering this question requires crossing both feature and routine boundaries. In fact, the kind of component-level specification that would assist people writing features may be very different from the specification required to guarantee user-facing behavior. We may want to ensure that a file is never deleted without a warning, but a feature implementor may be interested in internal information, such as when a certain directory is written to, or whether an update to an internal database happens before or after some file is deleted. Specifications that assist in writing features, may be so detailed as to be equivalent to the program itself. Even if it were reasonable to expect a component to specify any internal behavior that affects user-facing behavior, it may not suffice to safely add new features.

A bigger difficulty with composition is that it is not true that it is always easy to reason about the interaction of two or more well-specified components because the feasibility of reasoning about programs is not closed under many common forms of composition. For example, the halting problem for deterministic pushdown automata is not only decidable but even tractable (decidable in polynomial time) yet the halting problem for just two PDAs composed with message passing is not only intractable but undecidable. This is a serious theoretical problem with those who present composition as a guarantee for future success of reasoning about programs. Another important result shows that verifying the interaction of components is, in the worst case, no easier than verifying all the components as one, meaning it requires the same effort and understanding as writing a non-decomposed program would. No single person wrote that whole program, and none of the authors of the components has ever had the mental model required to prove their composition.

So we need another story to explain why we believe composition is feasible in practice. That story may go like this: People write each component for a reason and so should have a clear enough understanding of at least what it does, and this entails a simple interface specification. I think that the premise of this story is suspect, and even if we believe it, I am not sure that the conclusion is a sufficient condition to guarantee a feasible cost of verifying the composition of specifications for the theoretical reason given above. Reasoning difficulty is so drastically sensitive to changes in the number of components (we see such bifurcations everywhere in complexity theory) that placing the majority of real software in some stable subspace requires a deep theoretical understanding which we do not yet possess, and cannot be waved away with a story, especially one that is so strenuously connected with observation.

I am not at all pessimistic about the prospects of various tools and techniques in increasing the reliability of software or increasing our ability to create larger systems with similar reliability. But I do think we need to be cautious when betting on this or that approach, especially since we lack any theoretical framework that could explain why we expect real software to be feasibly analyzable in many circumstances. We already have various tools that have proven effective in some real-world cases, but they’re not silver bullets; far from it.2

The story of the person with the clear mental image certainly does apply when what they’re writing is an algorithm or even a high-level system design. Indeed, algorithms and architectures can often be given a manageable, fully-mechanizable mathematical representation. But people write algorithms and high-level designs; people don’t exactly write actual real-world software systems, at least not like in the story.

There’s something a bit ironic about it. The idea of a mathematical system of logic that would be able to answer all questions by mechanical computation began with Leibniz, but it turns out it is very difficult to answer questions about the system itself, not just in theory but also in practice. The most useful formal methods will be those that accept this reality, and answer the challenge by making baby steps, addressing particular problems in an ad hoc, incomplete way.

  1. A similar formalism has been proposed as a basis for formal reasoning about tax code

  2. I should add that these days many advocates for certain tools and techniques are fully aware that most programmers are wary of promises of silver bullets, and so when describing what they actually do seem to believe is a silver bullet they add, “it’s not a silver bullet.” I propose that when making this empty qualification, advocates would be required to explain why what they’re advocating is not a silver bullet. If they can’t, they are probably crypto-silver-bulletists who do not understand well enough the problems they pretend to solve.