Overview
Comment: | [graphspell] tokenizer: combining diacritics recognition and NFC normalization |
---|---|
Downloads: | Tarball | ZIP archive | SQL archive |
Timelines: | family | ancestors | descendants | both | trunk | graphspell |
Files: | files | file ages | folders |
SHA3-256: |
3ef2bdb736e458181f90ed3aef50d68c |
User & Date: | olr on 2020-04-20 18:02:29 |
Other Links: | manifest | tags |
Context
2020-04-21
| ||
08:11 | [fr] ajustements check-in: 014e846ccc user: olr tags: trunk, fr, v1.9.0 | |
2020-04-20
| ||
18:02 | [graphspell] tokenizer: combining diacritics recognition and NFC normalization check-in: 3ef2bdb736 user: olr tags: trunk, graphspell | |
17:44 | [fr] ajustements check-in: 04d726b729 user: olr tags: trunk, fr | |
Changes
Modified gc_lang/fr/perf_memo.txt from [70dc62f0cc] to [9508c4062f].
︙ | |||
24 25 26 27 28 29 30 | 24 25 26 27 28 29 30 31 32 | + + | 0.5.16 2017.05.12 07:41 4.92201 1.19269 0.80639 0.239147 0.257518 0.266523 0.62111 0.33359 0.0634668 0.00757178 0.6.1 2018.02.12 09:58 5.25924 1.2649 0.878442 0.257465 0.280558 0.293903 0.686887 0.391275 0.0672474 0.00824723 0.6.2 2018.02.19 19:06 5.51302 1.29359 0.874157 0.260415 0.271596 0.290641 0.684754 0.376905 0.0815201 0.00919633 (spelling normalization) 1.0 2018.11.23 10:59 2.88577 0.702486 0.485648 0.139897 0.14079 0.148125 0.348751 0.201061 0.0360297 0.0043535 (x2, with new GC engine) 1.1 2019.05.16 09:42 1.50743 0.360923 0.261113 0.0749272 0.0763827 0.0771537 0.180504 0.102942 0.0182762 0.0021925 (×2, but new processor: AMD Ryzen 7 2700X) 1.2.1 2019.08.06 20:57 1.42886 0.358425 0.247356 0.0704405 0.0754886 0.0765604 0.177197 0.0988517 0.0188103 0.0020243 1.6.0 2020.01.03 20:22 1.38847 0.346214 0.240242 0.0709539 0.0737499 0.0748733 0.176477 0.0969171 0.0187857 0.0025143 (nouveau dictionnaire avec lemmes masculin) 1.9.0 2020.04.20 19:57 1.51183 0.369546 0.25681 0.0734314 0.0764396 0.0785668 0.183922 0.103674 0.0185812 0.002099 (NFC normalization) |
Modified graphspell-js/tokenizer.js from [16e7826100] to [efabea9cdf].
︙ | |||
40 41 42 43 44 45 46 | 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | - + | [/^\[\/?[a-zA-Z]+\]/, 'PSEUDOHTML'], [/^&\w+;(?:\w+;|)/, 'HTMLENTITY'], [/^(?:l|d|n|m|t|s|j|c|ç|lorsqu|puisqu|jusqu|quoiqu|qu|presqu|quelqu)['’´‘′`ʼ]/i, 'WORD_ELIDED'], [/^\d\d?[h:]\d\d(?:[m:]\d\ds?|)\b/, 'HOUR'], [/^\d+(?:ers?\b|res?\b|è[rm]es?\b|i[èe][mr]es?\b|de?s?\b|nde?s?\b|ès?\b|es?\b|ᵉʳˢ?|ʳᵉˢ?|ᵈᵉ?ˢ?|ⁿᵈᵉ?ˢ?|ᵉˢ?)/, 'WORD_ORDINAL'], [/^\d+(?:[.,]\d+|)/, 'NUM'], [/^[&%‰€$+±=*/<>⩾⩽#|×¥£§¢¬÷@-]/, 'SIGN'], |
︙ | |||
70 71 72 73 74 75 76 | 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | - + | while (sText) { let iCut = 1; for (let [zRegex, sType] of this.aRules) { if (sType !== "SPACE" || bWithSpaces) { try { if ((m = zRegex.exec(sText)) !== null) { iToken += 1; |
︙ |
Modified graphspell/tokenizer.py from [a2c42f5f3e] to [81da836011].
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | + + | """ Very simple tokenizer using regular expressions """ import re from unicodedata import normalize _PATTERNS = { "default": ( r'(?P<FOLDERUNIX>/(?:bin|boot|dev|etc|home|lib|mnt|opt|root|sbin|tmp|usr|var|Bureau|Documents|Images|Musique|Public|Téléchargements|Vidéos)(?:/[\w.()-]+)*)', r'(?P<FOLDERWIN>[a-zA-Z]:\\(?:Program Files(?: [(]x86[)]|)|[\w.()]+)(?:\\[\w.()-]+)*)', r'(?P<PUNC>[][,.;:!?…«»“”‘’"(){}·–—¿¡])', r'(?P<WORD_ACRONYM>[A-Z][.][A-Z][.](?:[A-Z][.])*)', |
︙ | |||
32 33 34 35 36 37 38 | 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | - + - + | r'(?P<HTML><\w+.*?>|</\w+ *>)', r'(?P<PSEUDOHTML>\[/?\w+\])', r"(?P<WORD_ELIDED>(?:l|d|n|m|t|s|j|c|ç|lorsqu|puisqu|jusqu|quoiqu|qu|presqu|quelqu)['’´‘′`ʼ])", r'(?P<WORD_ORDINAL>\d+(?:ers?|res?|è[rm]es?|i[èe][mr]es?|de?s?|nde?s?|ès?|es?|ᵉʳˢ?|ʳᵉˢ?|ᵈᵉ?ˢ?|ⁿᵈᵉ?ˢ?|ᵉˢ?)\b)', r'(?P<HOUR>\d\d?[h:]\d\d(?:[m:]\d\ds?|)\b)', r'(?P<NUM>\d+(?:[.,]\d+|))', r'(?P<SIGN>[&%‰€$+±=*/<>⩾⩽#|×¥£¢§¬÷@-])', |