Overview
Comment: | [graphspell][py] dawg: ability to build lexicon directly from a list of tuples |
---|---|
Downloads: | Tarball | ZIP archive | SQL archive |
Timelines: | family | ancestors | descendants | both | graphspell | multid |
Files: | files | file ages | folders |
SHA3-256: |
c65e57833800132efb1e40769607685a |
User & Date: | olr on 2018-02-27 18:07:44 |
Original Comment: | [graphspell][py] ability to build lexicon directly from a list of tuples |
Other Links: | branch diff | manifest | tags |
Context
2018-02-27
| ||
20:50 | [graphspell][py] dawg: API modifications + add function to get dictionary as JSON check-in: 8a0391b163 user: olr tags: graphspell, multid | |
18:07 | [graphspell][py] dawg: ability to build lexicon directly from a list of tuples check-in: c65e578338 user: olr tags: graphspell, multid | |
17:23 | [lo] update: lexicon editor check-in: 1d5fe44fe8 user: olr tags: lo, multid | |
Changes
Modified graphspell/dawg.py from [96443fe4a2] to [059d031769].
︙ | ︙ | |||
23 24 25 26 27 28 29 | def readFile (spf): print(" < Read lexicon: " + spf) if os.path.isfile(spf): with open(spf, "r", encoding="utf-8") as hSrc: for sLine in hSrc: sLine = sLine.strip() if sLine and not sLine.startswith("#"): | | | | > | > > | | 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | def readFile (spf): print(" < Read lexicon: " + spf) if os.path.isfile(spf): with open(spf, "r", encoding="utf-8") as hSrc: for sLine in hSrc: sLine = sLine.strip() if sLine and not sLine.startswith("#"): yield sLine.split("\t") else: raise OSError("# Error. File not found or not loadable: " + spf) class DAWG: """DIRECT ACYCLIC WORD GRAPH""" # This code is inspired from Steve Hanov’s DAWG, 2011. (http://stevehanov.ca/blog/index.php?id=115) # We store suffix/affix codes and tags within the graph after the “real” word. # A word is a list of numbers [ c1, c2, c3 . . . cN, iAffix, iTags] # Each arc is an index in self.lArcVal, where are stored characters, suffix/affix codes for stemming and tags. # Important: As usual, the last node (after ‘iTags’) is tagged final, AND the node after ‘cN’ is ALSO tagged final. def __init__ (self, src, cStemming, sLangCode, sLangName="", sDicName=""): print("===== Direct Acyclic Word Graph - Minimal Acyclic Finite State Automaton =====") cStemming = cStemming.upper() if cStemming == "A": funcStemmingGen = st.defineAffixCode elif cStemming == "S": funcStemmingGen = st.defineSuffixCode elif cStemming == "N": funcStemmingGen = st.noStemming else: raise ValueError("# Error. Unknown stemming code: {}".format(cStemming)) lEntry = [] lChar = ['']; dChar = {}; nChar = 1; dCharOccur = {} lAff = []; dAff = {}; nAff = 0; dAffOccur = {} lTag = []; dTag = {}; nTag = 0; dTagOccur = {} nErr = 0 # read lexicon if type(src) is str: iterable = readFile(src) else: iterable = src for sFlex, sStem, sTag in iterable: addWordToCharDict(sFlex) # chars for c in sFlex: if c not in dChar: dChar[c] = nChar lChar.append(c) nChar += 1 |
︙ | ︙ | |||
93 94 95 96 97 98 99 | lWord = [ [dChar[c] for c in sFlex] + [iAff+nChar] + [iTag+nChar+nAff] for sFlex, iAff, iTag in lEntry ] lEntry = None # Dictionary of arc values occurrency, to sort arcs of each node dValOccur = dict( [ (dChar[c], dCharOccur[c]) for c in dChar ] \ + [ (dAff[aff]+nChar, dAffOccur[aff]) for aff in dAff ] \ + [ (dTag[tag]+nChar+nAff, dTagOccur[tag]) for tag in dTag ] ) | < < < < | | 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | lWord = [ [dChar[c] for c in sFlex] + [iAff+nChar] + [iTag+nChar+nAff] for sFlex, iAff, iTag in lEntry ] lEntry = None # Dictionary of arc values occurrency, to sort arcs of each node dValOccur = dict( [ (dChar[c], dCharOccur[c]) for c in dChar ] \ + [ (dAff[aff]+nChar, dAffOccur[aff]) for aff in dAff ] \ + [ (dTag[tag]+nChar+nAff, dTagOccur[tag]) for tag in dTag ] ) self.sFileName = src if type(src) is str else "[None]" self.sLangCode = sLangCode self.sLangName = sLangName self.sDicName = sDicName self.nEntry = len(lWord) self.aPreviousEntry = [] DawgNode.resetNextId() self.oRoot = DawgNode() |
︙ | ︙ |