Grammalecte  Check-in [2dbf497b04]

Overview
Comment:[graphspell] tokenizer: add lMorph to <start> and <end> tokens
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | graphspell | rg
Files: files | file ages | folders
SHA3-256: 2dbf497b0475e7d7bfb33fc484de7cef295cfbe211c56331e5b39c1ca6f767f2
User & Date: olr on 2018-06-29 22:46:33
Other Links: branch diff | manifest | tags
Context
2018-06-30
00:19
[core] ge engine: function for testing token value check-in: 8289f6c423 user: olr tags: core, rg
2018-06-29
22:46
[graphspell] tokenizer: add lMorph to <start> and <end> tokens check-in: 2dbf497b04 user: olr tags: graphspell, rg
22:43
[fr] conversion: regex rules -> graph rules check-in: e7335f789f user: olr tags: fr, rg
Changes

Modified graphspell/tokenizer.py from [a1211301ce] to [2adea5dc85].

50
51
52
53
54
55
56
57

58
59
60
61
62

50
51
52
53
54
55
56

57
58
59
60
61

62







-
+




-
+
            self.sLang = "default"
        self.zToken = re.compile( "(?i)" + '|'.join(sRegex for sRegex in _PATTERNS[sLang]) )

    def genTokens (self, sText, bStartEndToken=False):
        "generator: tokenize <sText>"
        i = 0
        if bStartEndToken:
            yield { "i": 0, "sType": "INFO", "sValue": "<start>", "nStart": 0, "nEnd": 0 }
            yield { "i": 0, "sType": "INFO", "sValue": "<start>", "nStart": 0, "nEnd": 0, "lMorph": ["<start>"] }
        for i, m in enumerate(self.zToken.finditer(sText), 1):
            yield { "i": i, "sType": m.lastgroup, "sValue": m.group(), "nStart": m.start(), "nEnd": m.end() }
        if bStartEndToken:
            iEnd = len(sText)
            yield { "i": i+1, "sType": "INFO", "sValue": "<end>", "nStart": iEnd, "nEnd": iEnd }
            yield { "i": i+1, "sType": "INFO", "sValue": "<end>", "nStart": iEnd, "nEnd": iEnd, , "lMorph": ["<end>"] }