One big part of SemanticDiff is to make diffs less noisy by hiding irrelevant changes. Typical examples of such changes are line breaks added between function arguments or when a single-quoted string is converted to a double-quoted string. Just the kinds of changes that a code formatter may introduce to make your code more readable.
In most cases this is straightforward, just don’t highlight the added line break. But what about the cases where the answer isn’t so obvious? Let’s look at two Python examples to explain what I mean.
- x = "foo"
+ x = 'foo'
This is the obvious one: The way the string is written changes, but its content remains the same. There is nothing to highlight for SemanticDiff. However, if we change this example just a little bit, things get more interesting:
- x = f"foo {bar}"
+ x = "foo {bar}"
If we just look at the text, the only difference between the two lines is that we have removed the “f” in front of the string literal. By doing this, we have turned the formatted string literal into a normal string literal which has consequences for the content of the string. The {bar}
part is no longer a placeholder for a value and just becomes the text “{bar}”, even though this part of the code hasn’t changed.
To highlight or not to highlight?
The second example raises a question: How should we highlight a case where a change in one part of the code affects the semantics of another part? There are basically three options we can choose from:
Option 1: Highlight only the text that changes
We could treat this like a traditional diff and highlight only the text that changes:
- x = f"foo {bar}"+ x = "foo {bar}"
Pro | Con |
---|---|
The aim is to make diffs less noisy and highlighting unchanged bits of code adds visual noise. | The consequences of a change might get overlooked. |
Option 2: Highlight only the consequences
We only show how the semantic interpretation has changed and ignore the changed parsing instruction:
- x = f"foo {bar}"+ x = "foo {bar}"
Pro | Con |
---|---|
The parsing instruction (f prefix) is not part of the code logic and should be ignored. |
Looks like a bug, we highlight unchanged code. |
Option 3: Highlight both
We can highlight both changes, the visible and “invisible” one:
- x = f"foo {bar}"+ x = "foo {bar}"
Pro | Con |
---|---|
The user has a better chance of finding out why an unchanged piece of code has been highlighted. | Adds the most “visual noise”. |
Each option has its own advantages and disadvantages and there is probably none that everyone would agree on. However, we believe that the purpose of a semantic diff is not only to hide changes, but also to inform developers about easily overlooked changes. We therefore use option 3, but can override this on a case-by-case basis if another option is clearly better.
A better solution?
This example showcased a very simple case, but things can get much more complex. Just consider languages that support macros. If you modify their definition, the change can affect many different places in the code. We therefore think that neither of the above options is good in the long run.
To avoid confusing developers, we need to extend the way diffs are displayed. By adding a third change type (besides added and removed code) that uses a different color (not red/green), we can clearly mark areas of code that have changed semantics even though the text remains the same. Ideally, the diff could also indicate which change caused the semantic difference.
Do you agree or would you choose a different solution?