Overview
| Comment: | [graphspell] tokenizer: add option for <start> and <end> tokens |
|---|---|
| Downloads: | Tarball | ZIP archive | SQL archive |
| Timelines: | family | ancestors | descendants | both | graphspell | rg |
| Files: | files | file ages | folders |
| SHA3-256: |
3339da64247027f881ef64d14bb66215 |
| User & Date: | olr on 2018-06-02 13:47:06 |
| Other Links: | branch diff | manifest | tags |
Context
|
2018-06-02
| ||
| 14:01 | [core] token offset for correct token positioning check-in: 38cd64c0b9 user: olr tags: core, rg | |
| 13:47 | [graphspell] tokenizer: add option for <start> and <end> tokens check-in: 3339da6424 user: olr tags: graphspell, rg | |
|
2018-06-01
| ||
| 10:51 | [core] gc engine update check-in: 102180fb1d user: olr tags: core, rg | |
Changes
Modified graphspell/tokenizer.py from [b3cbfe75ea] to [b723a02695].
| ︙ | ︙ | |||
40 41 42 43 44 45 46 |
def __init__ (self, sLang):
self.sLang = sLang
if sLang not in _PATTERNS:
self.sLang = "default"
self.zToken = re.compile( "(?i)" + '|'.join(sRegex for sRegex in _PATTERNS[sLang]) )
| | > > > > > | 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
def __init__ (self, sLang):
self.sLang = sLang
if sLang not in _PATTERNS:
self.sLang = "default"
self.zToken = re.compile( "(?i)" + '|'.join(sRegex for sRegex in _PATTERNS[sLang]) )
def genTokens (self, sText, bStartEndToken=False):
if bStartEndToken:
yield { "i": 0, "sType": "INFO", "sValue": "<start>", "nStart": 0, "nEnd": 0 }
for i, m in enumerate(self.zToken.finditer(sText), 1):
yield { "i": i, "sType": m.lastgroup, "sValue": m.group(), "nStart": m.start(), "nEnd": m.end() }
if bStartEndToken:
iEnd = len(sText)
yield { "i": i+1, "sType": "INFO", "sValue": "<end>", "nStart": iEnd, "nEnd": iEnd }
|