Differences From Artifact [b723a02695]:
- File graphspell/tokenizer.py — part of check-in [3339da6424] at 2018-06-02 13:47:06 on branch rg — [graphspell] tokenizer: add option for <start> and <end> tokens (user: olr, size: 2495) [annotate] [blame] [check-ins using]
To Artifact [30951f1c9c]:
- File graphspell/tokenizer.py — part of check-in [cca3887aad] at 2018-06-12 11:24:50 on branch rg — [core] text processor: communication between regex rules and graph rules + [graphspell][bug] tokenizer: set i variable to 0, if sentence is empty (user: olr, size: 2509) [annotate] [blame] [check-ins using]
| ︙ | |||
41 42 43 44 45 46 47 48 49 50 51 52 53 54 | 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | + |
def __init__ (self, sLang):
self.sLang = sLang
if sLang not in _PATTERNS:
self.sLang = "default"
self.zToken = re.compile( "(?i)" + '|'.join(sRegex for sRegex in _PATTERNS[sLang]) )
def genTokens (self, sText, bStartEndToken=False):
i = 0
if bStartEndToken:
yield { "i": 0, "sType": "INFO", "sValue": "<start>", "nStart": 0, "nEnd": 0 }
for i, m in enumerate(self.zToken.finditer(sText), 1):
yield { "i": i, "sType": m.lastgroup, "sValue": m.group(), "nStart": m.start(), "nEnd": m.end() }
if bStartEndToken:
iEnd = len(sText)
yield { "i": i+1, "sType": "INFO", "sValue": "<end>", "nStart": iEnd, "nEnd": iEnd }
|