1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
U
M±d2ã@s°ddlZddlmZddlmZddlmZddlmZm    Z    m
Z
m Z m Z ddl mZddlmZmZmZmZdd    lmZdd
lmZdd lmZmZmZmZmZee
ed œd d„Zee
edœdd„Z eƒee
ed œdd„ƒZ!eƒee
ed œdd„ƒZ"eedee e#e#fdœdd„ƒZ$d.e
ee#e
edœdd„Z%ee
ee&dœdd „Z'ee
ed!œd"d#„Z(e
eed$œd%d&„Z)eed$œd'd(„Z*ed)dd/ee&e eed+œd,d-„ƒZ+dS)0éN)ÚIncrementalDecoder)ÚCounter)Ú    lru_cache)rÚDictÚListÚOptionalÚTupleé)Ú FREQUENCIES)ÚKO_NAMESÚLANGUAGE_SUPPORTED_COUNTÚTOO_SMALL_SEQUENCEÚZH_NAMES)Ú is_suspiciously_successive_range)ÚCoherenceMatches)Úis_accentuatedÚis_latinÚis_multi_byte_encodingÚis_unicode_range_secondaryÚ unicode_range)Ú    iana_nameÚreturncs¶t|ƒrtdƒ‚t d |¡¡j}|dd}i‰d‰tddƒD]^}| t|gƒ¡}|r>t    |ƒ}|dkrhq>t
|ƒd    kr>|ˆkr„dˆ|<ˆ|d
7<ˆd
7‰q>t ‡‡fd d „ˆDƒƒS) zF
    Return associated unicode ranges in a single byte code page.
    z.Function not supported on multi-byte code pagez encodings.{}Úignore)Úerrorsré@éÿNFr    cs g|]}ˆ|ˆdkr|‘qS)g333333Ã?©)Ú.0Úcharacter_range©Úcharacter_countZ seen_rangesrúLd:\z\workplace\vscode\pyvenv\venv\Lib\site-packages\charset_normalizer/cd.pyÚ
<listcomp>3sþz*encoding_unicode_range.<locals>.<listcomp>) rÚIOErrorÚ    importlibÚ import_moduleÚformatrÚrangeÚdecodeÚbytesrrÚsorted)rÚdecoderÚpÚiÚchunkrrrr!Úencoding_unicode_ranges0ÿ
 
 þÿr/)Ú primary_rangercCs>g}t ¡D],\}}|D]}t|ƒ|kr| |¡q qq |S)z>
    Return inferred languages used with a unicode range.
    )r
ÚitemsrÚappend)r0Ú    languagesÚlanguageÚ
charactersÚ    characterrrr!Úunicode_range_languages;s 
r7cCs<t|ƒ}d}|D]}d|kr|}q&q|dkr4dgSt|ƒS)zœ
    Single-byte encoding language association. Some code page are heavily linked to particular language(s).
    This function does the correspondence.
    NZLatinú Latin Based)r/r7)rZunicode_rangesr0Zspecified_rangerrr!Úencoding_languagesJsr9cCs`| d¡s&| d¡s&| d¡s&|dkr,dgS| d¡s>|tkrDdgS| d¡sV|tkr\d    gSgS)
    Multi-byte encoding language association. Some code page are heavily linked to particular language(s).
    This function does the correspondence.
    Zshift_Ú
iso2022_jpZeuc_jÚcp932ÚJapaneseÚgbÚChineseÚ
iso2022_krÚKorean)Ú
startswithrr )rrrr!Úmb_encoding_languages^sÿþýürB)Úmaxsize)r4rcCsBd}d}t|D](}|s$t|ƒr$d}|rt|ƒdkrd}q||fS)zg
    Determine main aspects from a supported language if it contains accents and if is pure Latin.
    FT)r
rr)r4Útarget_have_accentsÚtarget_pure_latinr6rrr!Úget_target_featuresss  rFF)r5Úignore_non_latinrc s¬g}tdd„ˆDƒƒ}t ¡D]l\}}t|ƒ\}}|r@|dkr@q|dkrN|rNqt|ƒ}t‡fdd„|Dƒƒ}    |    |}
|
dkr| ||
f¡qt|dd„d    d
}d d„|DƒS) zE
    Return associated languages associated to given characters.
    css|]}t|ƒVqdS©N)r)rr6rrr!Ú    <genexpr>Œsz%alphabet_languages.<locals>.<genexpr>Fcsg|]}|ˆkr|‘qSrr)rÚc©r5rr!r"šsz&alphabet_languages.<locals>.<listcomp>gš™™™™™É?cSs|dS©Nr    r©Úxrrr!Ú<lambda>¢óz$alphabet_languages.<locals>.<lambda>T©ÚkeyÚreversecSsg|] }|d‘qS)rr)rZcompatible_languagerrr!r"¤s)Úanyr
r1rFÚlenr2r*) r5rGr3Zsource_have_accentsr4Zlanguage_charactersrDrEr Zcharacter_match_countÚratiorrKr!Úalphabet_languages„s"   ÿrW)r4Úordered_charactersrcCs¦|tkrtd |¡ƒ‚d}tt|ƒ}t|ƒ}tt|ƒ}|dk}t|td|ƒƒD]D\}}||krfqRt| |¡}    ||}
t||
ƒ} |dkr¢t    | |    ƒdkr¢qR|dkrÈt    | |    ƒ|dkrÈ|d7}qRt|d|    …} t||    d    …} |d|…}||d    …}tt|ƒt| ƒ@ƒ}tt|ƒt| ƒ@ƒ}t| ƒdkrJ|dkrJ|d7}qRt| ƒdkrl|dkrl|d7}qR|t| ƒd
ksŽ|t| ƒd
krR|d7}qRqR|t|ƒS) aN
    Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language.
    The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
    Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
    z{} not availableréFéTér    Ngš™™™™™Ù?)
r
Ú
ValueErrorr&ÚsetrUÚzipr'ÚindexÚintÚabs)r4rXZcharacter_approved_countZFREQUENCIES_language_setZordered_characters_countZ target_language_characters_countZlarge_alphabetr6Zcharacter_rankZcharacter_rank_in_languageZexpected_projection_ratioZcharacter_rank_projectionZcharacters_before_sourceZcharacters_after_sourceZcharacters_beforeZcharacters_afterZbefore_match_countZafter_match_countrrr!Úcharacters_popularity_compare§st  ÿÿ ÿþÿ
ÿþÿÿ  ÿÿÿþrb)Údecoded_sequencercCs”i}|D]~}| ¡dkrqt|ƒ}|dkr,qd}|D]}t||ƒdkr4|}qPq4|dkr\|}||krr| ¡||<q||| ¡7<qt| ¡ƒS)a
    Given a decoded text sequence, return a list of str. Unicode range / alphabet separation.
    Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list;
    One containing the latin letters and the other hebrew.
    FN)ÚisalpharrÚlowerÚlistÚvalues)rcZlayersr6rZlayer_target_rangeZdiscovered_rangerrr!Úalpha_unicode_split÷s, ÿÿ rh)Úresultsrcsfi‰|D]8}|D].}|\}}|ˆkr0|gˆ|<qˆ| |¡qq‡fdd„ˆDƒ}t|dd„ddS)z‹
    This function merge results previously given by the function coherence_ratio.
    The return type is the same as coherence_ratio.
    cs.g|]&}|ttˆ|ƒtˆ|ƒdƒf‘qS)rZ)ÚroundÚsumrU)rr4©Zper_language_ratiosrr!r",súþþz*merge_coherence_ratios.<locals>.<listcomp>cSs|dSrLrrMrrr!rO7rPz(merge_coherence_ratios.<locals>.<lambda>TrQ)r2r*)riÚresultZ
sub_resultr4rVÚmergerrlr!Úmerge_coherence_ratioss
 
ø rocs„tƒ‰|D]6}|\}}| dd¡}|ˆkr2gˆ|<ˆ| |¡q
t‡fdd„ˆDƒƒr€g}ˆD]}| |tˆ|ƒf¡q`|S|S)u³
    We shall NOT return "English—" in CoherenceMatches because it is an alternative
    of "English". This function only keeps the best match and remove the em-dash in it.
    u—Úc3s|]}tˆ|ƒdkVqdS)r    N)rU)rÚe©Z index_resultsrr!rIJsz/filter_alt_coherence_matches.<locals>.<genexpr>)ÚdictÚreplacer2rTÚmax)rirmr4rVZ
no_em_nameZfiltered_resultsrrrr!Úfilter_alt_coherence_matches:s rvi皙™™™™¹?)rcÚ    thresholdÚ lg_inclusionrcCsðg}d}d}|dk    r| d¡ng}d|kr8d}| d¡t|ƒD]˜}t|ƒ}| ¡}    tdd„|    Dƒƒ}
|
tkrpq@d    d
„|    Dƒ} |pŠt| |ƒD]J} t| | ƒ} | |kr¦qŒn| d kr¶|d 7}|     | t
| d ƒf¡|dkrŒq@qŒq@t t |ƒdd„ddS)z¨
    Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers.
    A layer = Character extraction by alphabets/ranges.
    FrNú,r8Tcss|]\}}|VqdSrHr©rrJÚorrr!rIlsz"coherence_ratio.<locals>.<genexpr>cSsg|] \}}|‘qSrrr{rrr!r"qsz#coherence_ratio.<locals>.<listcomp>gš™™™™™é?r    rZr[cSs|dSrLrrMrrr!rO…rPz!coherence_ratio.<locals>.<lambda>rQ) ÚsplitÚremoverhrÚ most_commonrkr rWrbr2rjr*rv)rcrxryrirGZsufficient_match_countZlg_inclusion_listZlayerZsequence_frequenciesrr Zpopular_character_orderedr4rVrrr!Úcoherence_ratioUsD    
 ÿÿÿr€)F)rwN),r$ÚcodecsrÚ collectionsrÚ    functoolsrÚtypingZ TypeCounterrrrrZassetsr
Zconstantr r r rZmdrÚmodelsrÚutilsrrrrrÚstrr/r7r9rBÚboolrFrWÚfloatrbrhrorvr€rrrr!Ú<module>sN          'ÿþ $þ P'ÿþ