Differences From Artifact [01eee1eb89]:
- File gc_lang/fr/build_data.py — part of check-in [685f9128f0] at 2017-11-02 11:01:33 on branch Lexicographe — [fr] restructuration des données pour éviter la confusion avec le token <:> (user: olr, size: 13982) [annotate] [blame] [check-ins using]
To Artifact [1e628c0406]:
- File gc_lang/fr/build_data.py — part of check-in [1dac73beb8] at 2017-11-03 19:43:18 on branch Lexicographe — [build][fr] build_data for locution: __END__ to skip what come after (user: olr, size: 14031) [annotate] [blame] [check-ins using]
| ︙ | |||
315 316 317 318 319 320 321 322 323 324 325 326 327 328 | 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | + + |
def makeLocutions (sp, bJS=False):
"compile list of locutions in JSON"
print("> Locutions ", end="")
print("(Python et JavaScript)" if bJS else "(Python seulement)")
dLocGraph = {}
oTokenizer = tkz.Tokenizer("fr")
for sLine in itertools.chain(readFile(sp+"/data/locutions.txt"), readFile(sp+"/data/locutions_vrac.txt")):
if sLine == "__END__":
break
dCur = dLocGraph
sLoc, sTag = sLine.split("\t")
for oToken in oTokenizer.genTokens(sLoc.strip()):
sWord = oToken["sValue"]
if sWord not in dCur:
dCur[sWord] = {}
dCur = dCur[sWord]
|
| ︙ |