Grammalecte  Check-in [564da6fd9a]

Overview
Comment:[build] rule names are now mandatory [doc] update
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk | build | doc
Files: files | file ages | folders
SHA3-256: 564da6fd9a651ab71eceba6bfd7d1a5b98c0e57fabc9f194b77649dd659efce6
User & Date: olr on 2020-04-16 17:28:26
Other Links: manifest | tags
Context
2020-04-16
19:43
[fr] ajustements, nr: confusions don/dont, plus tôt/plutôt check-in: c11b3d912c user: olr tags: trunk, fr
17:28
[build] rule names are now mandatory [doc] update check-in: 564da6fd9a user: olr tags: trunk, build, doc
16:36
[build][fr] variable sContext for regex rules too [doc] update check-in: d31122ee4c user: olr tags: trunk, fr, build, doc
Changes

Modified compile_rules.py from [9f8d749ce1] to [ed8f69534c].

14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


dDEFINITIONS = {}
dDECLENSIONS = {}
lFUNCTIONS = []

aRULESET = set()     # set of rule-ids to check if there is several rules with the same id
nRULEWITHOUTNAME = 0

dJSREGEXES = {}

sWORDLIMITLEFT  = r"(?<![\w.,–-])"   # r"(?<![-.,—])\b"  seems slower
sWORDLIMITRIGHT = r"(?![\w–-])"      # r"\b(?!-—)"       seems slower









<







14
15
16
17
18
19
20

21
22
23
24
25
26
27


dDEFINITIONS = {}
dDECLENSIONS = {}
lFUNCTIONS = []

aRULESET = set()     # set of rule-ids to check if there is several rules with the same id


dJSREGEXES = {}

sWORDLIMITLEFT  = r"(?<![\w.,–-])"   # r"(?<![-.,—])\b"  seems slower
sWORDLIMITRIGHT = r"(?![\w–-])"      # r"\b(?!-—)"       seems slower


134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179

180
181
182
183
184
185
186
        print(sRegex)
    return 0


def createRule (s, nIdLine, sLang, bParagraph, dOptPriority):
    "returns rule as list [option name, regex, bCaseInsensitive, identifier, list of actions]"
    global dJSREGEXES
    global nRULEWITHOUTNAME

    sLineId = "#" + str(nIdLine) + ("p" if bParagraph else "s")
    sRuleId = sLineId

    #### GRAPH CALL
    if s.startswith("@@@@"):
        if bParagraph:
            print("Error. Graph call can be made only after the first pass (sentence by sentence)")
            exit()
        return ["@@@@", s[4:], sLineId]

    #### OPTIONS
    sOption = False         # False or [a-z0-9]+ name
    nPriority = 4           # Default is 4, value must be between 0 and 9
    tGroups = None          # code for groups positioning (only useful for JavaScript)
    cCaseMode = 'i'         # i: case insensitive,  s: case sensitive,  u: uppercasing allowed
    cWordLimitLeft = '['    # [: word limit, <: no specific limit
    cWordLimitRight = ']'   # ]: word limit, >: no specific limit
    m = re.match("^__(?P<borders_and_case>[\\[<]\\w[\\]>])(?P<option>/[a-zA-Z0-9]+|)(?P<ruleid>\\(\\w+\\)|)(?P<priority>![0-9]|)__ *", s)
    if m:
        cWordLimitLeft = m.group('borders_and_case')[0]
        cCaseMode = m.group('borders_and_case')[1]
        cWordLimitRight = m.group('borders_and_case')[2]
        sOption = m.group('option')[1:]  if m.group('option')  else False
        if m.group('ruleid'):
            sRuleId =  m.group('ruleid')[1:-1]
            if sRuleId in aRULESET:
                print("# Error. Several rules have the same id: " + sRuleId)
                exit()
            aRULESET.add(sRuleId)
        else:
            nRULEWITHOUTNAME += 1
        nPriority = dOptPriority.get(sOption, 4)
        if m.group('priority'):
            nPriority = int(m.group('priority')[1:])
        s = s[m.end(0):]
    else:
        print("# Warning. No option defined at line: " + sLineId)


    #### REGEX TRIGGER
    i = s.find(" <<-")
    if i == -1:
        print("# Error: no condition at line " + sLineId)
        return None
    sRegex = s[:i].strip()







<


















|





<
|
|
|
|
|
<
<





|
>







133
134
135
136
137
138
139

140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163

164
165
166
167
168


169
170
171
172
173
174
175
176
177
178
179
180
181
182
        print(sRegex)
    return 0


def createRule (s, nIdLine, sLang, bParagraph, dOptPriority):
    "returns rule as list [option name, regex, bCaseInsensitive, identifier, list of actions]"
    global dJSREGEXES


    sLineId = "#" + str(nIdLine) + ("p" if bParagraph else "s")
    sRuleId = sLineId

    #### GRAPH CALL
    if s.startswith("@@@@"):
        if bParagraph:
            print("Error. Graph call can be made only after the first pass (sentence by sentence)")
            exit()
        return ["@@@@", s[4:], sLineId]

    #### OPTIONS
    sOption = False         # False or [a-z0-9]+ name
    nPriority = 4           # Default is 4, value must be between 0 and 9
    tGroups = None          # code for groups positioning (only useful for JavaScript)
    cCaseMode = 'i'         # i: case insensitive,  s: case sensitive,  u: uppercasing allowed
    cWordLimitLeft = '['    # [: word limit, <: no specific limit
    cWordLimitRight = ']'   # ]: word limit, >: no specific limit
    m = re.match("^__(?P<borders_and_case>[\\[<]\\w[\\]>])(?P<option>/[a-zA-Z0-9]+|)(?P<ruleid>\\(\\w+\\))(?P<priority>![0-9]|)__ *", s)
    if m:
        cWordLimitLeft = m.group('borders_and_case')[0]
        cCaseMode = m.group('borders_and_case')[1]
        cWordLimitRight = m.group('borders_and_case')[2]
        sOption = m.group('option')[1:]  if m.group('option')  else False

        sRuleId =  m.group('ruleid')[1:-1]
        if sRuleId in aRULESET:
            print("# Error. Several rules have the same id: " + sRuleId)
            exit()
        aRULESET.add(sRuleId)


        nPriority = dOptPriority.get(sOption, 4)
        if m.group('priority'):
            nPriority = int(m.group('priority')[1:])
        s = s[m.end(0):]
    else:
        print("# Warning. Rule wrongly shaped at line: " + sLineId)
        exit()

    #### REGEX TRIGGER
    i = s.find(" <<-")
    if i == -1:
        print("# Error: no condition at line " + sLineId)
        return None
    sRegex = s[:i].strip()
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
            pass
        elif bGraph:
            lGraphRule.append([i, sLine])
        # Regex rules
        elif re.match("[  \t]*$", sLine):
            # empty line
            pass
        elif sLine.startswith(("    ", "\t")):
            # rule (continuation)
            lRuleLine[-1][1] += " " + sLine.strip()
        else:
            # new rule
            lRuleLine.append([i, sLine.strip()])

    # generating options files







|







556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
            pass
        elif bGraph:
            lGraphRule.append([i, sLine])
        # Regex rules
        elif re.match("[  \t]*$", sLine):
            # empty line
            pass
        elif sLine.startswith("    "):
            # rule (continuation)
            lRuleLine[-1][1] += " " + sLine.strip()
        else:
            # new rule
            lRuleLine.append([i, sLine.strip()])

    # generating options files
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
        # JavaScript
        sJSCallables += "    {}: function ({})".format(sFuncName, sParams) + " {\n"
        sJSCallables += "        return " + jsconv.py2js(sReturn) + ";\n"
        sJSCallables += "    },\n"

    displayStats(lParagraphRules, lSentenceRules)

    print("Unnamed rules: " + str(nRULEWITHOUTNAME))

    dVars = {
        "fBuildTime": fBuildTime,
        "callables": sPyCallables,
        "callablesJS": sJSCallables,
        "gctests": sGCTests,
        "gctestsJS": sGCTestsJS,
        "paragraph_rules": mergeRulesByOption(lParagraphRules),







<
<







622
623
624
625
626
627
628


629
630
631
632
633
634
635
        # JavaScript
        sJSCallables += "    {}: function ({})".format(sFuncName, sParams) + " {\n"
        sJSCallables += "        return " + jsconv.py2js(sReturn) + ";\n"
        sJSCallables += "    },\n"

    displayStats(lParagraphRules, lSentenceRules)



    dVars = {
        "fBuildTime": fBuildTime,
        "callables": sPyCallables,
        "callablesJS": sJSCallables,
        "gctests": sGCTests,
        "gctestsJS": sGCTestsJS,
        "paragraph_rules": mergeRulesByOption(lParagraphRules),

Modified doc/syntax.txt from [d2fc6c099d] to [22d5e1895e].

80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

Patterns are written with the Python syntax for regular expressions:
http://docs.python.org/library/re.html

There can be one or several actions for each rule, executed following the order they are
written.

Optional: option, rulename, priority, condition, URL

LCR flags means:

* L: Left boundary for the regex
* C: Case sensitiveness
* R: Right boundary for the regex








|







80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

Patterns are written with the Python syntax for regular expressions:
http://docs.python.org/library/re.html

There can be one or several actions for each rule, executed following the order they are
written.

Optional: option, priority, condition, URL

LCR flags means:

* L: Left boundary for the regex
* C: Case sensitiveness
* R: Right boundary for the regex

110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208

>   `s`     case sensitive

>   `u`     uppercase allowed for lowercase characters

>>          i.e.:  "Word"  becomes  "W[oO][rR][dD]"

Examples:

    __[i]__
    __<s]__
    __[u>__
    __<s>__


User option activating/deactivating is possible with an option name placed
just after the LCR flags, i.e.:

    __[i]/option1__
    __[u]/option2__
    __[s>/option1__
    __<u>/option3__
    __<i>/option3__

Rules can be named:

    __[i]/option1(name1)__
    __[u]/option2(name2)__
    __[s>/option1(name3)__
    __<u>(name4)__
    __<i>(name5)__

Each rule name must be unique.


The LCR flags are also optional. If you don’t set these flags, the default LCR
flags will be:

    __[i]__

Example. Report “foo” in the text and suggest “bar”:

    foo <<- ->> bar         # Use bar instead of foo.

Example. Recognize and suggest missing hyphen and rewrite internally the text
with the hyphen:

    __[s]__
        foo bar
            <<- ->> foo-bar     # Missing hyphen.
            <<- ~>> foo-bar


### Simple-line or multi-line rules

Rules can be break to multiple lines by leading spaces.
You should use 4 spaces.

Examples:

    __<s>__ pattern <<- condition ->> replacement # message

    __<s>__
        pattern
            <<- condition ->> replacement
            # message
            <<- condition ->> suggestion # message
            <<- condition ~>> text_rewriting
            <<- =>> disambiguation


### Whitespaces at the border of patterns or suggestions

Example: Recognize double or more spaces and suggests a single space:

    __<s>__  "  +" <<- ->> " "      # Remove extra space(s).

Characters `"` protect spaces in the pattern and in the replacement text.


### Pattern groups and back references

It is usually useful to retrieve parts of the matched pattern. We simply use
parenthesis in pattern to get groups with back references.

Example. Suggest a word with correct quotation marks:

    \"(\w+)\" <<- ->> “\1”      # Correct quotation marks.

Example. Suggest the missing space after the signs `!`, `?` or `.`:

    __<i]__  \b([?!.])([A-Z]+) <<- ->> \1 \2     # Missing space?

Example. Back reference in messages.

    (fooo) bar <<- ->> foo      # “\1” should be:


### Group positioning codes for JavaScript:







|
<
<
<
<
<
<




|
|
|
<
<
<
<
<
<
<
<
|
|



<
<
<
<
<
<
<
<
<
<



|












|

|












|















|







110
111
112
113
114
115
116
117






118
119
120
121
122
123
124








125
126
127
128
129










130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

>   `s`     case sensitive

>   `u`     uppercase allowed for lowercase characters

>>          i.e.:  "Word"  becomes  "W[oO][rR][dD]"

Examples: `[i]`, `<s]`, `[u>`, `<s>`







User option activating/deactivating is possible with an option name placed
just after the LCR flags, i.e.:

    __[i]/option1(rulename1)__
    __[u]/option2(rulename2)__
    __[s>/option3(rulename3)__








    __<u>(rulename4)__
    __<i>(rulename5)__

Each rule name must be unique.











Example. Recognize and suggest missing hyphen and rewrite internally the text
with the hyphen:

    __[s](rulename)__
        foo bar
            <<- ->> foo-bar     # Missing hyphen.
            <<- ~>> foo-bar


### Simple-line or multi-line rules

Rules can be break to multiple lines by leading spaces.
You should use 4 spaces.

Examples:

    __<s>(rulename)__ pattern <<- condition ->> replacement # message

    __<s>(rulename)__
        pattern
            <<- condition ->> replacement
            # message
            <<- condition ->> suggestion # message
            <<- condition ~>> text_rewriting
            <<- =>> disambiguation


### Whitespaces at the border of patterns or suggestions

Example: Recognize double or more spaces and suggests a single space:

    __<s>(rulename)__  "  +" <<- ->> " "      # Remove extra space(s).

Characters `"` protect spaces in the pattern and in the replacement text.


### Pattern groups and back references

It is usually useful to retrieve parts of the matched pattern. We simply use
parenthesis in pattern to get groups with back references.

Example. Suggest a word with correct quotation marks:

    \"(\w+)\" <<- ->> “\1”      # Correct quotation marks.

Example. Suggest the missing space after the signs `!`, `?` or `.`:

    \b([?!.])([A-Z]+) <<- ->> \1 \2     # Missing space?

Example. Back reference in messages.

    (fooo) bar <<- ->> foo      # “\1” should be:


### Group positioning codes for JavaScript:
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
    " ([?!;])"  @@1


### Pattern matching

Repeating pattern matching of a single rule continues after the previous matching, so instead of general multiword patterns, like

        (\w+) (\w+) <<- some_check(\1, \2) ->> \1, \2 # foo

use

        (\w+) <<- some_check(\1, word(1)) ->> \1, # foo


## TOKEN RULES ##

Token rules must be defined within a graph.

### Token rules syntax







|



|







202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
    " ([?!;])"  @@1


### Pattern matching

Repeating pattern matching of a single rule continues after the previous matching, so instead of general multiword patterns, like

    (\w+) (\w+) <<- some_check(\1, \2) ->> \1, \2 # foo

use

    (\w+) <<- some_check(\1, word(1)) ->> \1, # foo


## TOKEN RULES ##

Token rules must be defined within a graph.

### Token rules syntax