Why are we still storing source code in text files?

Whenever I have what seems like an original idea I half expect someone to tell me it was done 10 years ago and point me to the github repo. I’m going to go out on a limb and write it up anyway.

Programming is often collaborative now. People work in parallel on the same code base and periodically merge their changes back together into a single application. Source control tools have evolved to support this workflow, and while they do it very well, merging can still often be a messy, manual and error prone process.

Some code changes are particularly problematic. Renaming a variable can affect multiple lines of text. Moving things around inside text files (blocks of code, functions/methods) make it difficult for text comparison tools to match up the source lines against the original source file. Source code beautification tools can reformat your entire code base, making the change diff light up like a Christmas tree (without actually changing any code).

So this got me thinking. For a while I wondered if diff/merge tools could be made smarter and more language aware, so that they could ignore formatting that didn’t change the meaning of the code.

But then I decided the real problem was: text files.

What’s wrong with text files?

My beef is this: program code in text format mixes different concerns into one single file.

Diff/merge tools can’t separate these concerns so they and just perform a line-by-line text compare/merge. This leads to more merge conflicts, manual checking and errors than would otherwise occur.

Consider this code I pulled out of a random source file:

int Basic4GLEditor::GetWatchIndex(QListWidgetItem* item)
{
 int index = GetListItemIndex(ui->watchList, item);

 // Note: Last watch item is dummy blank one (can be double clicked to create a new item).
 // Treat it as "no item clicked"
 if (index == ui->watchList->count() - 1)
 index = -1;

 return index;
}

void Basic4GLEditor::InternalValToString(vmValType valType, std::string& result)
{
 int maxChars = VM_DATATOSTRINGMAXCHARS;
 result = vm.ValToString(vm.Reg(), valType, maxChars);
}

There are a number of concerns mixed together:

  1. We have two methods which define how the code is called externally.
  2. We have instructions which tell the computer what to do.
  3. The methods have a particular order.
  4. We have whitespace mixed throughout to indent it and lay it out.
  5. We have internal variable names (“index”, “maxChars”).
  6. We have comments documenting what the functions do.

(And there are concerns like parameter names, but this should be enough to illustrate…)

To illustrate the problem, imagine that Bob checks out the code and renames one of the local variables:

int Basic4GLEditor::GetWatchIndex(QListWidgetItem* item)
{
	int watchIndex = GetListItemIndex(ui->watchList, item);

	// Note: Last watch item is dummy blank one (can be double clicked to create a new item).
	// Treat it as "no item clicked"
	if (watchIndex == ui->watchList->count() - 1)
		watchIndex = -1;

	return watchIndex;
}

At the same time, Mary checks out the original code refactors it a little and adds some logging.

int Basic4GLEditor::GetWatchIndex(QListWidgetItem* item)
{
	int index = GetListItemIndex(ui->watchList, item);

	// Note: Last watch item is dummy blank one (can be double clicked to create a new item).
	// Treat it as "no item clicked"
	if (index + 1 == ui->watchList->count())
		return -1;

	Log("Watch item clicked: %d", index);

	return index;
}

Committing both sets of changes back to source control will result in a merge conflict between the two lines of code in the if statement, which must be resolved manually. Further more the person resolving the conflict will have to recognise that neither Bob’s nor Mary’s code is now correct. The correctly merged code is Bob’s variable rename applied to Mary’s code.

Furthermore Mary’s new logging line will be merged in automatically as it is simply an insert. But this causes an even bigger problem, because it still refers to the local variable by it’s old name “insert” the line is no longer correct. But the merge tool has no way of knowing this, and will quietly insert the new line without protesting. The problem will only be discovered later when the code is compiled.

An alternative to text files

Let’s try a different format for storing source code and see if it helps us in this scenario. Instead of a plain text file, let’s try organising the code into a key-value table.

First we’ll add the method declarations:

Key Value
MTHD_00000_NAME Basic4GLEditor::GetWatchIndex
MTHD_00000_PARAMS QListWidgetItem* item
MTHD_00000_BODY INSTR_00000
MTHD_00001_NAME Basic4GLEditor::InternalValToString
MTHD_00001_PARAMS vmValType valType, std::string& result
MTHD_00001_BODY INSTR_00100

Then we’ll add the instructions:

Key Value
INSTR_00000_CODE int [VAR_00000] = GetListItemIndex(ui->watchList, item)
INSTR_00000_NEXT INSTR_00001
INSTR_00001_CODE if ([VAR_00000] == ui->watchList->count() – 1)
INSTR_00001_BODY INSTR_00002
INSTR_00001_NEXT INSTR_00003
INSTR_00002_CODE [VAR_00000] = -1
INSTR_00003_CODE return [VAR_00000]
INSTR_00100_CODE int [VAR_00001] = VM_DATATOSTRINGMAXCHARS
INSTR_00100_NEXT INSTR_00101
INSTR_00101_CODE result = vm.ValToString(vm.Reg(), valType, [VAR_00001])

I’ve also taken the liberty of replacing the local variables with [VAR_xxxxx] references. So lets add them as key-value pairs as well:

Key Value
VAR_00000_NAME index
VAR_00001_NAME maxChars

OK, so this is not a very human readable or maintainable format, but writing an editor which can read this format and transform it into human-readable source code would not be impossible.

The key of each entry is built from its type, a unique ID, and an attribute. (E.g. VAR_00001_NAME stores the name of the variable with ID 00001). For simplicity I’ve used an incrementing number for the ID, but a real implementation should be something globally unique (like a GUID).

An important characteristic of this format is that the keys are unordered. To achieve this I’ve linked the instructions together with a NEXT attribute to form a forward chained list. The “if” instruction also has a BODY link, which links to the body of the if statement.

This also means that braces and semicolons are no longer necessary to denote code blocks and terminate instructions, so I’ve left them out.

Merging changes

With an unordered key-value store diffing and merging are quite simple:

  • Any new key is an insert.
  • Any missing key is a delete.
  • Any entry with the same key whose value has changed is an update.

And because keys are unique and the order doesn’t matter, it is trivial to match up entries.

Merging two change sets involves applying the inserts and deletes, then looking for updates to the same keys – which are your merge conflicts.

So let’s consider Bob and Mary’s changes in the context of this format.

Bob’s variable rename is simply:

Operation Key Value
UPDATE VAR_00000_NAME watchIndex

All the instructions refer to the variable via it’s ID, which doesn’t change, so the instructions remain unaffected.

Mary’s changes are a bit longer:

Operation Key Value
UPDATE INSTR_00001_CODE if ([VAR_00000] + 1 == ui->watchList->count())
UPDATE INSTR_00001_NEXT INSTR_00004
UPDATE INSTR_00002_CODE return -1
INSERT INSTR_00004_CODE Log(“Watch item clicked: %d”, [VAR_00000])
INSERT INSTR_00004_NEXT INSTR_00003

Essentially we update 2 lines of code and insert another, which requires changing the “NEXT” links to ensure the new line ends up in the right place.

To merge these two changes sets is straightforward. None of these keys conflict with each other, so there are no merge conflicts. And because Mary’s changes and insertion all refer to the local variable via it’s ID they remain correct even after the variable is renamed.

The result is a clean merge and a correct program with no manual intervention.

Is this a silver bullet?

Of course not.

There will still be nasty merge conflicts, and difficult scenarios requiring a programmer to think hard about what each change set is trying to do.

But it does suggest that if we start thinking about other ways to break down the source code storage into a format that more cleanly separates concerns, we will make the merge tool’s job a lot simpler and decrease the likelihood of conflicts.

In this case we separated a variable’s name from a reference to that variable inside instruction code, which made renaming the variable a small isolated change. In fact there’s a good argument for applying this to all identifiers (method names, parameter names, class names etc) so that anything can be renamed cleanly and simply.

And there are other scenarios where this kind of approach helps, such as moving a block of instructions up or down within a function/method. As long as the IDE preserves the IDs of each instruction then the change is essentially just adjusting the linkage between a couple of instructions. If someone else concurrently edits one of the lines in the block you’ve moved, then the merge of your changes will include his edit to the line in the new position. This is something that text file based merge tools have difficulty with.

Parting thoughts

Tooling

If this approach (or something similar) was followed through and a format created captures everything we wanted to capture about program source code, it would still need tooling support.

The IDE in particular would need to understand the format and translate back and forth between it and what is displayed on screen to the programmer, and do so in a way that preserved the IDs of instructions across including when cut, pasted, copied and deleted.

Source control systems would need to understand the format as well. This should be straightforward to implement internally, but conveying information to the user in a meaningful way would require some thought and design (how would you show a variable rename merge conflict for example?).

Discarded information

I deliberately discarded white space and ordering of methods from my example above.

There are ways to fit them in I’m sure, but it’s worth stopping to think about whether it’s really necessary?

By removing white-space:

  • You discard a whole heap of information that doesn’t affect the program at all, and all the possible merge conflicts that come with it.
  • You avoid holy wars between programmers trying to establish a common formatting standard. (Each programmer can choose whichever standard they prefer).
  • You also avoid pull requests from someone who’s reformatted your entire code base with their favourite code formatting tool.

The IDE can re-insert white space, new lines, braces and semicolons automatically when translating it back to a human readable format, using whatever style conventions the programmer prefers.

MAKE tools

One advantage of having a format that separates program code information is that it’s easier for MAKE tools to determine when a change actually affects the compiled code. If you’ve ever sat there for a few minutes while the compiler rebuilt all your source files because you change the comment in a common header file then you get what I mean.

If comments are cleanly separated from actual code, then the MAKE tools can simply ignore the comment change and recompile only when entries that affect the compiled code are modified.

This also holds true for white-space changes, function ordering etc.

In fact you could arguably say that renaming a variable doesn’t affect the compiled code. There are some caveats to this, such as run time type information and debugging symbols, but it’s not ridiculous to envisage such information being split into two pieces: The debugging/RTTI information references variables via their internal ID, and a separate lookup structure then gives the variable name. In which case renaming a variable would only require rebuilding the ID->name table, which would often be significantly faster than recompiling the affected modules.

This entry was posted in Ideas, Uncategorized and tagged , , . Bookmark the permalink.