Index: doc/build.md ================================================================== --- doc/build.md +++ doc/build.md @@ -2,15 +2,15 @@ # How to build Grammalecte ## Required ## * Python 3.6 +* Firefox Developper * Firefox Nightly * NodeJS * npm - * jpm : https://developer.mozilla.org/en-US/Add-ons/SDK/Tools/jpm - * web-ext : https://developer.mozilla.org/fr/Add-ons/WebExtensions/Getting_started_with_web-ext + * web-ext : `https://developer.mozilla.org/fr/Add-ons/WebExtensions/Getting_started_with_web-ext` * Thunderbird ## Commands ## @@ -18,11 +18,10 @@ `make.py LANG` > Generate the LibreOffice extension and the package folder. > LANG is the lang code (ISO 639). - > This script uses the file `config.ini` in the folder `gc_lang/LANG`. **First build** `make.py LANG -js` @@ -52,17 +51,17 @@ > Install the LibreOffice extension. `-fx --firefox` -> Launch Firefox Developper (before Firefox 57). -> Unit tests can be lanched from Firefox, with CTRL+SHIFT+F12. +> Launch Firefox Developper. +> Unit tests can be launched from the menu (Tests section). `-we --webext` -> Launch Firefox Nightly (Firefox 57+). -> Unit tests can be lanched from the menu. +> Launch Firefox Nightly. +> Unit tests can be launched from the menu (Tests section). `-tb --thunderbird` > Launch Thunderbird. Index: doc/syntax.txt ================================================================== --- doc/syntax.txt +++ doc/syntax.txt @@ -1,26 +1,47 @@ WRITING RULES FOR GRAMMALECTE -Note: This documentation is obsolete right now. +Note: This documentation is a draft. Information may be obsolete. # Principles # Grammalecte is a bi-passes grammar checker engine. On the first pass, the -engine checks the text paragraph by paragraph. On the second passe, the engine +engine checks the text paragraph by paragraph. On the second pass, the engine check the text sentence by sentence. The command to switch to the second pass is `[++]`. In each pass, you can write as many rules as you need. -A rule is defined by: +There is two kinds of rules: + +* regex rules (triggered by a regular expression) +* token rules (triggered by a succession of tokens) + +A regex rule is defined by: * [optional] flags “LCR” for the regex word boundaries and case sensitiveness * a regex pattern trigger -* a list of actions (can’t be empty) -* [optional] user option name for activating/disactivating the rule -* [optional] rule name +* a list of actions +* [optional] option name (the rule is active only if the option defined by user or config is active) +* [optional] rule name (named rules can be disabled by user or by config) + +A token rules is defined by: + +* rule name +* one or several lists of tokens (triggers) +* a list of actions (the action is active only if the option defined by user or config is active) + +Token rules must be defined within a graph. + +Each graph is defined within the second pass with the command: + + @@@@GRAPH: graph_name + +A graph ends when another graph is defined or when is defined the command: + + @@@@END_GRAPH There is no limit to the number of actions and the type of actions a rule can launch. Each action has its own condition to be triggered. There are three kind of actions: @@ -35,14 +56,29 @@ All these files are simple utf-8 text file. UTF-8 is mandatory. +# Comments # -# Rule syntax # +Lines beginning with `#` are comments. - __LCR/option(rulename)__ pattern + +# End of file # + +With the command: + + #END + +at the beginning of a line, the parser won’t go further. +Whatever is written after will be considered as comments. + + +# Regex rule syntax # + + __LCR/option(rulename)__ + pattern <<- condition ->> error_suggestions # message_error|http://awebsite.net... <<- condition ~>> text_rewriting <<- condition =>> commands_for_disambiguation ... @@ -139,31 +175,15 @@ ____ pattern <<- condition ->> replacement # message <<- condition ->> suggestion # message - <<- condition - ~>> text_rewriting + <<- condition ~>> text_rewriting <<- =>> disambiguation ____ pattern <<- condition ->> replacement # message - -## Comments ## - -Lines beginning with # are comments. - - -## End of file ## - -With the command: - - #END - -at the beginning of a line, the compiler won’t go further. -Whatever is written after will be considered as comments. - ## Whitespaces at the border of patterns or suggestions ## Example: Recognize double or more spaces and suggests a single space: @@ -187,10 +207,22 @@ Example. Back reference in messages. (fooo) bar <<- ->> foo # “\1” should be: + +## Pattern matching ## + +Repeating pattern matching of a single rule continues after the previous matching, so +instead of general multiword patterns, like + +(\w+) (\w+) <<- some_check(\1, \2) ->> \1, \2 # foo + +use + +(\w+) <<- some_check(\1, word(1)) ->> \1, # foo + ## Name definitions ## Grammalecte supports name definitions to simplify the description of the complex rules. @@ -300,10 +332,70 @@ You can also call Python expressions. __[s]__ Mr. ([a-z]\w+) <<- ~1>> =\1.upper() +# Text preprocessing and multi-passes checking # + +On each pass, Lightproof uses rules written in the text preprocessor to modify +internally the text before checking the text. + +The text preprocessor is useful to simplify texts and write simplier checking +rules. + +For example, sentences with the same grammar mistake: + + These “cats” are blacks. + These cats are “blacks”. + These cats are absolutely blacks. + These stupid “cats” are all blacks. + These unknown cats are as per usual blacks. + +Instead of writting complex rules or several rules to find mistakes for all possible +cases, you can use the text preprocessor to simplify the text. + +To remove the chars “”, write: + + [“”] ->> * + +The * means: replace text by whitespaces. + +Similarly to grammar rules, you can add conditions: + + \w+ly <<- morph(\0, "adverb") ->> * + +You can also remove a group reference: + + these (\w+) (\w+) <<- morph(\1, "adjective") and morph(\2, "noun") -1>> * + (am|are|is|were|was) (all) <<- -2>> * + +With these rules, you get the following sentences: + + These cats are blacks. + These cats are blacks . + These cats are blacks. + These cats are blacks. + These cats are blacks. + +These grammar mistakes can be detected with one simple rule: + + these +(\w+) +are +(\w+s) + <<- morph(\1, "noun") and morph(\2, "plural") + -2>> _ # Adjectives are invariable. + +Instead of replacing text with whitespaces, you can replace text with @. + + https?://\S+ ->> @ + +This is useful if at first pass you write rules to check successive whitespaces. +@ are automatically removed at the second pass. + +You can also replace any text as you wish. + + Mister <<- ->> Mr + (Mrs?)[.] <<- ->> \1 + # Disambiguation # When Grammalecte analyses a word with morph or morphex, before requesting the POS tags to the dictionary, it checks if there is a stored marker for the @@ -389,30 +481,31 @@ `textarea(regex[, neg_regex])` > checks if the full text of the checked area (paragraph or sentence) matches the regex. -`morph(n, regex[, strict=True][, noword=False])` +`morph(n, regex[, neg_regex][, no_word=False])` + +> checks if all tags of the word in group n match the regex. +> if neg_regex = "*", returns True only if all morphologies match the regex. +> if there is no word at position n, returns the value of no_word. + +`analyse(n, regex[, neg_regex][, no_word=False])` > checks if all tags of the word in group n match the regex. -> if strict = False, returns True only if one of tags matches the regex. -> if there is no word at position n, returns the value of noword. - -`morphex(n, regex, neg_regex[, noword=False])` - -> checks if one of the tags of the word in group n match the regex and -> if no tags matches the neg_regex. -> if there is no word at position n, returns the value of noword. +> if neg_regex = "*", returns True only if all morphologies match the regex. +> if there is no word at position n, returns the value of no_word. + `option(option_name)` > returns True if option_name is activated else False Note: the analysis is done on the preprocessed text. -## Default variables ## +# Default variables # `sCountry` > It contains the current country locale of the checked paragraph. @@ -420,98 +513,26 @@ # Expressions in the suggestions # -Suggestions (and warning messages) started by an equal sign are Python string expressions +Suggestions started by an equal sign are Python string expressions extended with possible back references and named definitions: Example: - foo\w+ ->> = '"' + \0.upper() + '"' # With uppercase letters and quoation marks - -All words beginning with "foo" will be recognized, and the suggestion is -the uppercase form of the string with ASCII quoation marks: eg. foom ->> "FOOM". - - - - -//////////////////////////////// OLD /////////////////////////////////////// - -= Text preprocessing and multi-passes checking = - -On each pass, Lightproof uses rules written in the text preprocessor to modify -internally the text before checking the text. - -The text preprocessor is useful to simplify texts and write simplier checking -rules. - -For example, sentences with the same grammar mistake: - - These “cats” are blacks. - These cats are “blacks”. - These cats are absolutely blacks. - These stupid “cats” are all blacks. - These unknown cats are as per usual blacks. - -Instead of writting complex rules or several rules to find mistakes for all possible -cases, you can use the text preprocessor to simplify the text. - -To remove the chars “”, write: - - [“”] ->> * - -The * means: replace text by whitespaces. - -Similarly to grammar rules, you can add conditions: - - \w+ly <<- morph(\0, "adverb") ->> * - -You can also remove a group reference: - - these (\w+) (\w+) <<- morph(\1, "adjective") and morph(\2, "noun") -1>> * - (am|are|is|were|was) (all) <<- -2>> * - -With these rules, you get the following sentences: - - These cats are blacks. - These cats are blacks . - These cats are blacks. - These cats are blacks. - These cats are blacks. - -These grammar mistakes can be detected with one simple rule: - - these +(\w+) +are +(\w+s) - <<- morph(\1, "noun") and morph(\2, "plural") - -2>> _ # Adjectives are invariable. - -Instead of replacing text with whitespaces, you can replace text with @. - - https?://\S+ ->> @ - -This is useful if at first pass you write rules to check successive whitespaces. -@ are automatically removed at the second pass. - -You can also replace any text as you wish. - - Mister <<- ->> Mr - (Mrs?)[.] <<- ->> \1 - - - -With the multi-passes checking and the text preprocessor, it is advised to -remove or simplify the text which has been checked on the previous pass. - - - -== Pattern matching == - -Repeating pattern matching of a single rule continues after the previous matching, so -instead of general multiword patterns, like - -(\w+) (\w+) <<- some_check(\1, \2) ->> \1, \2 # foo - -use - -(\w+) <<- some_check(\1, word(1)) ->> \1, # foo - + <<- ->> = '"' + \1.upper() + '"' # With uppercase letters and quotation marks + + +# Token rules + +Token rules must be defined within a graph. + +## Tokens + +Tokens can be defined in several ways: + +* Value (meaning the text of the token). Examples: `word`, ``, ``, `,`. +* Lemma: `>lemma` +* Rege: `~pattern` +* Regex on morphologies: `@pattern`, `@pattern¬antipattern`. +* Metatags: *NAME. Examples: `*WORD`, `*SIGN`, etc.