Module References¶
indictrans.transliterator
— Transliterator¶
-
class
indictrans.
Transliterator
(source='hin', target='eng', decode='viterbi', build_lookup=False, rb=True)¶ Transliterator for Indic scripts including English and Urdu.
Parameters: - source : str, default: hin
Source Language (3 letter ISO-639 code)
- target : str, default: eng
Target Language (3 letter ISO-639 code)
- decode : str, default: viterbi
Decoding algorithm, either
viterbi
orbeamsearch
.- build_lookup : bool, default: False
Flag to build lookup-table. Fastens the transliteration process if the input text contains repeating words.
- rb : bool, default: True
Decides whether to use rule-based system or ML system for transliteration. This choice is only for Indic to Indic transliterations. If
True
uses ruled-based one.
Examples
>>> from indictrans import Transliterator >>> trn = Transliterator(source='hin', target='eng', build_lookup=True) >>> hin = '''कांग्रेस पार्टी अध्यक्ष सोनिया गांधी, तमिलनाडु की मुख्यमंत्री ... जयललिता और रिज़र्व बैंक के गवर्नर रघुराम राजन के बीच एक ... समानता है. ये सभी अलग-अलग कारणों से भारतीय जनता पार्टी के ... राज्यसभा सांसद सुब्रमण्यम स्वामी के निशाने पर हैं. उनके ... जयललिता और सोनिया गांधी के पीछे पड़ने का कारण कथित ... भ्रष्टाचार है.''' >>> eng = trn.transform(hin) >>> print(eng) congress party adhyaksh sonia gandhi, tamilnadu kii mukhyamantri jayalalita our reserve baink ke governor raghuram rajan ke beech ek samanta hai. ye sabi alag-alag carnon se bharatiya janata party ke rajyasabha saansad subramanyam swami ke nishane par hain. unke jayalalita our sonia gandhi ke peeche padane ka kaaran kathith bhrashtachar hai.
Methods
convert
indictrans.base.BaseTransliterator
— BaseTransliterator¶
-
class
indictrans.base.
BaseTransliterator
(source, target, decoder, build_lookup=False)¶ Base class for transliterator.
Attributes: - vectorizer_ : instance
OneHotEncoder instance for converting categorical features to one-hot features.
- classes_ : dict
Dictionary of set of tags with unique ids ({id: tag}).
- coef_ : array
HMM coefficient array
- intercept_init_ : array
HMM intercept array for first layer of trellis.
- intercept_trans_ : array
HMM intercept/transition array for middle layers of trellis.
- intercept_final_ : array
HMM intercept array for last layer of trellis.
- wx_process : method
wx2utf/utf2wx method of WX instance
- nu : instance
UrduNormalizer instance for normalizing Urdu scripts.
Methods
convert_to_wx
(text)Converts Indic scripts to WX. load_models
()Loads transliteration models. predict
(word[, k_best])Given encoded word matrix and HMM parameters, predicts output sequence (target word) top_n_trans
(text[, k_best])Returns k-best transliterations using beamsearch decoding. transliterate
(text[, k_best])Single best transliteration using viterbi decoding. base_fit load_mappings -
convert_to_wx
(text)¶ Converts Indic scripts to WX.
-
load_models
()¶ Loads transliteration models.
-
predict
(word, k_best=5)¶ Given encoded word matrix and HMM parameters, predicts output sequence (target word)
-
top_n_trans
(text, k_best=5)¶ Returns k-best transliterations using beamsearch decoding.
Parameters: - k_best : int, default: 5, optional
Used by Beamsearch decoder to return k-best transliterations.
-
transliterate
(text, k_best=None)¶ Single best transliteration using viterbi decoding.
indictrans._utils.WX
— WXConverter¶
-
class
indictrans._utils.
WX
(order=u'utf2wx', lang=u'hin')¶ WX-converter for UTF to WX conversion of Indic scripts and vice-versa.
Parameters: - lang : str, default: hin
Input script
- order : str, default: utf2wx
Order of conversion
Examples
>>> from indictrans import WX >>> wxc = WX(lang='hin', order='utf2wx') >>> hin_utf = u'''बीजेपी के सांसद सुब्रमण्यम स्वामी ने कुछ ही दिन पहले ... अपनी ही सरकार को कठघरे में खड़ा करते हुए जीडीपी आंकड़ों पर ... सवाल उठाए हैं.''' >>> hin_wx = wxc.utf2wx(hin_utf) >>> print(hin_wx) bIjepI ke sAMsaxa subramaNyama svAmI ne kuCa hI xina pahale apanI hI sarakAra ko kaTaGare meM KadZA karawe hue jIdIpI AMkadZoM para savAla uTAe hEM. >>> wxc = WX(lang='hin', order='wx2utf') >>> hin_utf_ = wxc.wx2utf(hin_wx) >>> print(hin_utf_) बीजेपी के सांसद सुब्रमण्यम स्वामी ने कुछ ही दिन पहले अपनी ही सरकार को कठघरे में खड़ा करते हुए जीडीपी आंकड़ों पर सवाल उठाए हैं. >>> wxc = WX(lang='mal', order='utf2wx') >>> mal_utf = u'''വിപണിയിലെ ശുഭാപ്തിവിശ്വാസക്കാരായ കാളകള്ക്ക് അനുകൂലമായ ... രീതിയിലാണ് ബി എസ് ഇയില് വ്യാപാരം നടക്കുന്നത്.''' >>> mal_wx = wxc.utf2wx(mal_utf) >>> print(mal_wx) vipaNiyileV SuBApwiviSvAsakkArAya kAlYakalYkk anukUlamAya rIwiyilAN bi eVs iyil vyApAraM natakkunnaw. >>> wxc = WX(lang='mal', order='wx2utf') >>> mal_utf_ = wxc.wx2utf(mal_wx) >>> print(mal_utf_) വിപണിയിലെ ശുഭാപ്തിവിശ്വാസക്കാരായ കാളകള്ക്ക് അനുകൂലമായ രീതിയിലാണ് ബി എസ് ഇയില് വ്യാപാരം നടക്കുന്നത്.
Methods
iscii2unicode
(iscii)Convert ISCII to Unicode iscii2wx
(my_string)Convert ISCII to WX normalize
(text)Performs some common normalization, which includes: - Byte order mark, word joiner, etc. unicode2iscii
(unicode_)Convert Unicode to ISCII utf2wx
(unicode_)Convert UTF string to WX-Roman wx2iscii
(my_string)Convert WX to ISCII wx2utf
(wx)Convert WX-Roman to UTF fit initialize_utf2wx_hash initialize_wx2utf_hash iscii2unicode_ben iscii2unicode_guj iscii2unicode_hin iscii2unicode_kan iscii2unicode_mal iscii2unicode_ori iscii2unicode_pan iscii2unicode_tam iscii2unicode_tel map_EY map_EY2 map_OY map_OY2 map_Z map_ZeV map_ZoV map_a map_eV map_eV2 map_lY map_lYY map_nY map_oV map_oV2 map_q map_rY unicode2iscii_ben unicode2iscii_guj unicode2iscii_hin unicode2iscii_kan unicode2iscii_mal unicode2iscii_ori unicode2iscii_pan unicode2iscii_tam unicode2iscii_tel -
iscii2unicode
(iscii)¶ Convert ISCII to Unicode
-
iscii2wx
(my_string)¶ Convert ISCII to WX
-
normalize
(text)¶ Performs some common normalization, which includes: - Byte order mark, word joiner, etc. removal - ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal
-
unicode2iscii
(unicode_)¶ Convert Unicode to ISCII
-
utf2wx
(unicode_)¶ Convert UTF string to WX-Roman
-
wx2iscii
(my_string)¶ Convert WX to ISCII
-
wx2utf
(wx)¶ Convert WX-Roman to UTF
indictrans._utils.OneHotEncoder
— OneHotEncoder¶
-
class
indictrans._utils.
OneHotEncoder
¶ Transforms categorical features to continuous numeric features.
Examples
>>> from one_hot_encoder import OneHotEncoder >>> enc = OneHotEncoder() >>> sequences = [list('bat'), list('cat'), list('rat')] >>> enc.fit(sequences) <one_hot_encoder.OneHotEncoder instance at 0x7f346d71c200> >>> enc.transform(sequences, sparse=False).astype(int) array([[0, 1, 0, 1, 1], [1, 0, 0, 1, 1], [0, 0, 1, 1, 1]]) >>> enc.transform(list('cat'), sparse=False).astype(int) array([[1, 0, 0, 1, 1]]) >>> enc.transform(list('bat'), sparse=True) <1x5 sparse matrix of type '<type 'numpy.float64'>' with 3 stored elements in Compressed Sparse Row format>
Methods
fit
(X)Fit OneHotEncoder to X. transform
(X[, sparse])Transform X using one-hot encoding. -
fit
(X)¶ Fit OneHotEncoder to X.
Parameters: - X : array-like, shape [n_samples, n_feature]
Input array of type int.
Returns: - self
-
transform
(X, sparse=True)¶ Transform X using one-hot encoding.
Parameters: - X : array-like, shape [n_samples, n_features]
Input array of categorical features.
- sparse : bool, default: True
Return sparse matrix if set True else return an array.
Returns: - X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
Transformed input.
-
indictrans._utils.UrduNormalizer
— UrduNormalizer¶
-
class
indictrans._utils.
UrduNormalizer
¶ Normalizer for Urdu scripts. Normalizes different unicode canonical equivalances to a single unicode code-point.
Examples
>>> from indictrans import UrduNormalizer >>> text = u'''ﺎﻧ کﻭ ﻍیﺮﻗﺎﻧﻮﻧی ﺝگہ کﺱ ﻥے ﺩی؟ ... ﻝﻭگﻭں کﻭ ﻖﺘﻟ کیﺍ ﺝﺍﺭ ہﺍ ہے ۔ ... ﺏڑے ﻡﺎﻣﻭں ﺎﻧ ﺪﻧﻭں ﻢﺤﻟہ ﺥﺩﺍﺩﺍﺩ ﻡیں ﺭہﺕے ﺕھے۔ ... ﻉﻭﺎﻣی یﺍ ﻑﻼﺣی ﺥﺪﻣﺎﺗ ﺍیک ﺎﻟگ ﺩﺎﺋﺭہ ﻊﻤﻟ ہے۔''' >>> nu = UrduNormalizer() >>> print(nu.normalize(text)) ان کو غیرقانونی جگہ کس نے دی؟ لوگوں کو قتل کیا جار ہا ہے ۔ بڑے ماموں ان دنوں محلہ خداداد میں رہتے تھے۔ عوامی یا فلاحی خدمات ایک الگ دائرہ عمل ہے۔
Methods
cnorm
(text)Normalize NO_BREAK_SPACE, SOFT_HYPHEN, WORD_JOINER, H_SPACE, ZERO_WIDTH[SPACE, NON_JOINER, JOINER], MARK[LEFT_TO_RIGHT, RIGHT_TO_LEFT, BYTE_ORDER, BYTE_ORDER_2] normalize
(text)normalize text -
cnorm
(text)¶ Normalize NO_BREAK_SPACE, SOFT_HYPHEN, WORD_JOINER, H_SPACE, ZERO_WIDTH[SPACE, NON_JOINER, JOINER], MARK[LEFT_TO_RIGHT, RIGHT_TO_LEFT, BYTE_ORDER, BYTE_ORDER_2]
-
normalize
(text)¶ normalize text
-
indictrans.trunk.StructuredPerceptron
— StructuredPerceptron¶
-
class
indictrans.trunk.
StructuredPerceptron
(lr_exp=0.1, n_iter=15, random_state=None, verbose=0)¶ Structured perceptron for sequence classification.
The implemention is based on average structured perceptron algorithm of M. Collins.
Parameters: - lr_exp : float, default: 0.1
The Exponent used for inverse scaling of learning rate. Given iteration number t, the effective learning rate is
1. / (t ** lr_exp)
- n_iter : int, default: 15
Maximum number of epochs of the structured perceptron algorithm
- random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by
np.random
.- verbose : int, default: 0 (quiet mode)
Verbosity mode.
References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Methods
fit
(X, y)Fit the model to the given set of sequences. predict
(X)Predict output sequences for input sequences in X. -
fit
(X, y)¶ Fit the model to the given set of sequences.
Parameters: - X : {array-like, sparse matrix}, shape (n_sequences, sequence_length,
n_features)
Feature matrix of train sequences.
- y : list of arrays, shape (n_sequences, sequence_length)
Target labels.
Returns: - self : object
Returns self.
-
predict
(X)¶ Predict output sequences for input sequences in X.
Parameters: - X : {array-like, sparse matrix}, shape (n_sequences, sequence_length,
n_features)
Feature matrix of test sequences.
Returns: - y : array, shape (n_sequences, sequence_length)
Labels per sequence in X.