Overview
Comment: | [graphspell] tokenizer: add token index and avoid punctuations aggregation |
---|---|
Downloads: | Tarball | ZIP archive | SQL archive |
Timelines: | family | ancestors | descendants | both | graphspell | rg |
Files: | files | file ages | folders |
SHA3-256: |
be6d99bbdc1bc63d0e212605b25292b4 |
User & Date: | olr on 2018-05-18 13:11:15 |
Other Links: | branch diff | manifest | tags |
Context
2018-05-19
| ||
14:06 | [build][core] merge actions in key <rules> + code clarification check-in: a59fbc32a0 user: olr tags: core, build, rg | |
2018-05-18
| ||
13:11 | [graphspell] tokenizer: add token index and avoid punctuations aggregation check-in: be6d99bbdc user: olr tags: graphspell, rg | |
2018-05-17
| ||
09:09 | [build][core] use 1 instead of empty string for specific tags check-in: 1895dda13e user: olr tags: core, build, rg | |
Changes
Modified graphspell-js/tokenizer.js from [bdd895b918] to [9bd60cca8a].
︙ | |||
14 15 16 17 18 19 20 | 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | - + - + | const aTkzPatterns = { // All regexps must start with ^. "default": [ [/^[ \t]+/, 'SPACE'], [/^\/(?:~|bin|boot|dev|etc|home|lib|mnt|opt|root|sbin|tmp|usr|var|Bureau|Documents|Images|Musique|Public|Téléchargements|Vidéos)(?:\/[a-zA-Zà-öÀ-Ö0-9ø-ÿØ-ßĀ-ʯfi-st_.()-]+)*/, 'FOLDERUNIX'], [/^[a-zA-Z]:\\(?:Program Files(?: \(x86\)|)|[a-zA-Zà-öÀ-Ö0-9ø-ÿØ-ßĀ-ʯfi-st.()]+)(?:\\[a-zA-Zà-öÀ-Ö0-9ø-ÿØ-ßĀ-ʯfi-st_.()-]+)*/, 'FOLDERWIN'], |
︙ | |||
58 59 60 61 62 63 64 | 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | - + - + + - + - - - - - + - - + - - + + | this.sLang = "default"; } this.aRules = aTkzPatterns[this.sLang]; } * genTokens (sText) { let m; |
Modified graphspell/tokenizer.py from [17f452887e] to [b3cbfe75ea].
1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | - + - + | # Very simple tokenizer import re _PATTERNS = { "default": ( r'(?P<FOLDERUNIX>/(?:bin|boot|dev|etc|home|lib|mnt|opt|root|sbin|tmp|usr|var|Bureau|Documents|Images|Musique|Public|Téléchargements|Vidéos)(?:/[\w.()-]+)*)', r'(?P<FOLDERWIN>[a-zA-Z]:\\(?:Program Files(?: [(]x86[)]|)|[\w.()]+)(?:\\[\w.()-]+)*)', |
︙ | |||
41 42 43 44 45 46 47 | 41 42 43 44 45 46 47 48 49 | - - + + | def __init__ (self, sLang): self.sLang = sLang if sLang not in _PATTERNS: self.sLang = "default" self.zToken = re.compile( "(?i)" + '|'.join(sRegex for sRegex in _PATTERNS[sLang]) ) def genTokens (self, sText): |