Like most online manuals, the Krita manual has a contributor’s guide. It’s filled with things like “who is our assumed audience?”, “what is the dialect of English we should use?”, etc. It’s not a perfect guide, outdated in places, definitely, but I think it does its job.
So, sometimes I, who officially maintains the Krita manual, look at other project’s contributor’s guides. And usually what I find there is…
Style Guides
The purpose of a style guide is to obtain consistency in writing over the whole project. This can make the text easier to read and simpler to translate. In principle, the Krita manual also has a style guide, because it stipulates you should use American English. But when I find style guides in other projects it’s often filled with things like “active sentences, not passive sentences”, and “use the Oxford comma”.
Active sentences versus passive sentences always gets me. What it means is the difference between “dog bites man” and “man is bitten by dog”. The latter sentence is the one in the passive voice. There’s nothing grammatically incorrect about it. It’s a bit longer, sure, but on the other hand, there’s value in being able to rearrange the sentence like that. For a more Krita specific example, consider this:
“Pixels are stored in working memory. Working memory is limited by hardware.”
“Working memory stores the pixels. Hardware limits the working memory.”
The first example is two sentences in the passive voice, the latter two in the active. The passive voice example is longer, but it is also easier to read, as it groups the concepts together and introduces new concepts later in the paragraph. Because we grouped the concepts, we can even merge the sentences:
“Pixels are stored in working memory, which is limited by hardware.”
But surely, if so many manuals have this in their guide, maybe there is a reason for it? No, the reason it’s in so many manuals’ style guide, is because other manuals have it there. And the reason other manuals have it there, is because magazines and newspapers have it there. And the reason they have that, is because it is recommended by style guides like The Elements of Style. There is some(?) value for magazines and newspapers in avoiding the passive voice because it tends to result in longer sentences than the active voice, but for electronic manuals, I don’t see the point of worrying about things like these. We have an infinite word count, so maybe we should just use that to make the text easier to read?
The problem of copying style rules like this is also obfuscated by the fact a lot of people don’t really know how to write. In a lot of those cases, the style guide seems to be there to allow role playing that you are a serious documentation project, if not a case of ‘look busy’, and it can be very confusing to the person being proofread. I’ve accepted the need for active voice in my university papers, because I figured my teachers wanted to help me lower my word count. I stopped accepting it when I discovered they couldn’t actually identify the passive voice, pointing at paragraphs that needed no work.
This kind of insecurity-driven proofreading becomes especially troublesome when you consider that sometimes “incorrect” language is caused by the writer using a dialect. It makes sense to avoid dialect in documentation, as they contain specific language features that not everyone may know, but it’s a whole other thing entirely to tell people their dialect is wrong. So in these cases, it’s imperative the proofreader knows why certain rules are in place so they can communicate why something should be changed without making the dialect speaker insecure about their language use.
Furthermore, a lot of such style guide rules are filled with linguistic slang, which is abstract and often derived from Latin. People who are not confident in writing will find such terms very intimidating, as well as hard to understand, and this in turn leads to people being less likely to contribute. In a lot of those cases, we can actually identify the problems in question via a computer program. So maybe we should just do that, and not fill our contributor’s guide with scary linguistic terms?

LanguageTool
Despite my relaxed approach to proofreading, I too have points at which I draw the line. In particular, there’s things like typos, missing punctuation, errant white-spaces. All these are pretty uncontroversial.
In the past, I’d occasionally run LanguageTool over the text. LanguageTool is a java based style and grammar checker licensed under LGPL 2.1. It has a plugin for LibreOffice, which I used a lot when writing university papers. However, by itself LanguageTool cannot recognize mark-up. To run it over the Krita documentation, I had to first run the text through pandoc to convert from reStructuredText to plain text, which was then fed to the LanguageTool jar.
I semi-automated this task via a bash script:
#!/bin/sh
# Run this file inside the language tool folder.
# First argument is the folder, second your mother tongue.
for file in $(find $1 -iname "*.rst");
do
pandoc -s $file -f rst -t plain -o ~/checkfile.txt
echo $file
echo $file >> ~/language_errors.txt
# Run language tool for en-us, without multiple whitespace checking and without the m-dash suggestion rule, using the second argument as the mother tongue to check for false friends.
java -jar languagetool-commandline.jar -l en-US -m $2 --json -d WHITESPACE_RULE,DASH_RULE[1] ~/checkfile.txt >> ~/language_errors.txt
rm ~/checkfile.txt
done
This worked pretty decently, though there were a lot of false positives (mitigated a bit by turning off some rules). It was also always a bit of a trick to find the precise location of the error, because the conversion to plaintext changed the position of the error.
I had to give up on this hacky method when we started to include python support, as that meant python code examples. And there was no way to tell pandoc to strip the code examples. So in turn that meant there were just too many false positives to wade through.
There is a way to handle mark-up, though, and that’s by writing a java wrapper around LanguageTool that parses through the marked-up text, and then tells LanguageTool which parts are markup and which parts can be analyzed as text. I kind of avoided doing this for a while because I had better things to do than to play with regexes, and my Java is very rusty.
One of the things that motivated me to look at it again was the appearance of the code quality widget in the Gitlab release notes. Because one of my problems is that notions of “incorrect” language can be used to bully people, I was looking for ways to indicate that everything LanguageTool puts out is considered a suggestion first and foremost. The code quality widget is just a tiny widget that hangs out underneath the merge request description, that says how many extra mistakes the merge request introduces, and is intended to be used with static analysis tools. It doesn’t block the MR, it doesn’t confuse the discussion, and it takes a JSON input, so I figured it’d the ideal candidate for something as trivial as style mistakes.
So, I started up eclipse, followed the instructions on using the Java api (intermissioned by me realizing I had never used maven and needing a quick tutorial), and I started writing regular expressions.
Reusing KSyntaxHighlighter?
So, people who know KDE’s many frameworks know that we have a collection of assorted regex and similar for a wide variety of mark up systems and languages, KSyntaxHighlighter, and it has support for reStructuredText. I had initially hoped I could just write something that could take the rest.xml file and use that to identify the mark up for LanguageTool.
Unfortunately, the regex needs of KSyntaxHighlighter is very different from the ones I need for LanguageTool. KSyntax needs to know whether we have entered a certain context based on the mark-up, but it doesn’t really need to identify the mark-up itself. For example, the mark up for strong in reStructuredText is **strong**
.
The regular expression to detect this in rest.xml is \*\*[^\s].*\*\*
, translated: Find a *, another *, a character that is not a space, a sequence of zero or more characters of any kind, another * and finally *.
What I ended up needing is: "(?<bStart>\*+?)(?<bText>[^\s][^\*]+?)(?<bEnd>\*+?)"
, translated: Find group of *, name it ‘bStart’, followed by a group that does not start with a space, and any number of characters after it that is not a *, name this ‘bText’, followed by a group of *, name this ‘bEnd’.
The bStart/bText/bEnd names allow me to append the groups separately to the AnnotatedTextBuilder:
if (inlineMarkup.group("bStart") != null) {
builder.addMarkup(line.substring(inlineMarkup.start("bStart"), inlineMarkup.end("bStart")));
handleReadingMarks(line.substring(inlineMarkup.start("bText"), inlineMarkup.end("bText")));
builder.addMarkup(line.substring(inlineMarkup.start("bEnd"), inlineMarkup.end("bEnd")));
}
So I had to abandon adopting the KSyntaxHighlighter format for this and do my own regexes.
Results
Eventually, I had something that worked. I managed to get it to write the errors it found to a json file that should work the code quality widget. I also implemented an accepted words list, which at the very least took a third off the initial set of errors. I’ve managed to actually get it to find about 105 errors on the 5000 word KritaFAQ, most of which are misspelled brand names, but it also found missing commas and errant white-spaces.
A small sample of the error output:
{
"severity": "info",
"fingerprint": "docs-krita-org/KritaFAQ.rst:8102:8106",
"description": "Did you mean <suggestion>Wi-Fi<\/suggestion>? (This is the officially approved term by the Wi-Fi Alliance.) (``wifi``)",
"check_name": "WIFI[1]",
"location": {
"path": "docs-krita-org/KritaFAQ.rst",
"position": {
"end": {"line": 176},
"begin": {"line": 176}
},
"lines": {"begin": 176}
},
"categories": ["Style"],
"type": "issue",
"content": "Type: Other, Category: Possible Typo, Position: 8102-8106 \n\nIt might be that your download got corrupted and is missing files (common with bad wifi and bad internet connection in general), in that case, try to find a better internet connection before trying to download again. \nProblem: Did you mean <suggestion>Wi-Fi<\/suggestion>? (This is the officially approved term by the Wi-Fi Alliance.) \nSuggestion: [Wi-Fi] \nExplanation: null"
},
{
"severity": "info",
"fingerprint": "docs-krita-org/KritaFAQ.rst:8379:8388",
"description": "Possible spelling mistake found. (``harddrive``)",
"check_name": "MORFOLOGIK_RULE_EN_US",
"location": {
"path": "docs-krita-org/KritaFAQ.rst",
"position": {
"end": {"line": 177},
"begin": {"line": 177}
},
"lines": {"begin": 177}
},
"categories": ["Style"],
"type": "issue",
"content": "Type: Other, Category: Possible Typo, Position: 8379-8388 \n\nCheck whether your harddrive is full and reinstall Krita with at least 120 MB of empty space. \nProblem: Possible spelling mistake found. \nSuggestion: [hard drive] \nExplanation: null"
},
{
"severity": "minor",
"fingerprint": "docs-krita-org/KritaFAQ.rst:8546:8550",
"description": "Use a comma before 'and' if it connects two independent clauses (unless they are closely connected and short). (`` and``)",
"check_name": "COMMA_COMPOUND_SENTENCE[1]",
"location": {
"path": "docs-krita-org/KritaFAQ.rst",
"position": {
"end": {"line": 177},
"begin": {"line": 177}
},
"lines": {"begin": 177}
},
"categories": ["Style"],
"type": "issue",
"content": "Type: Other, Category: Punctuation, Position: 8546-8550 \n\nIf not, and the problem still occurs, there might be something odd going on with your device and it's recommended to find a computer expert to diagnose what is the problem.\n \nProblem: Use a comma before 'and' if it connects two independent clauses (unless they are closely connected and short). \nSuggestion: [, and] \nExplanation: null"
}
There’s still a number of issues. Some mark up is still not processed, I need to figure out how to calculate the column, and just simply that I am unhappy with the command line arguments (they’re positional only right now).
One of the things I am really worrying about is the severity of errors. Like I mentioned before, dialects often get targeted by things that determine “incorrect” language, and LanguageTool does have rules that target slang and dialect. Similarly, people tend to take suggestions from computers more readily without question, so, I’ll need to introduce some configuration.
- Configuration to turn rules on and off.
- Errors that are uncontroversial should be marked higher, so that people are less likely to assume all the errors should be fixed.
But that’ll be at a later point…
Now, you might be wondering: “Where is the actual screenshot of this thing working in the Gitlab UI?” Well, I haven’t gotten it to work there yet. Partially because the manual doesn’t have CI implemented yet (we’re waiting for KDE’s servers to be ready), and partially because I know nothing about CI and have barely got an idea of Java, and am kinda stuck?
But, I can run it for myself now, so I can at the least do some fixes myself. I put the code up here, bear in mind I don’t remember how to use Java at all, so if I am committing Java sins, please be patient with me. Hopefully, if we can get this to work, we can greatly simplify how we handle style and grammar mistakes like these during review, as well as simplifying contributor’s guides.