Stemming testo non strutturato in NLTK

Ho provato lo stemmer regex, ma ottengo centinaia di token non correlati. Sono solo interessato allo stelo "gioca". Ecco il codice con cui sto lavorando:Stemming testo non strutturato in NLTK

import nltk 
from nltk.book import * 
f = open('tupac_original.txt', 'rU') 
text = f.read() 
text1 = text.split() 
tup = nltk.Text(text1) 
lowtup = [w.lower() for w in tup if w.isalpha()] 
import sys, re 
tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')] 
from nltk import stem 
tupstem = stem.RegexpStemmer('az$|as$|a$') 
[tupstem.stem(i) for i in tupclean]

Il risultato di quanto sopra è;

['like', 'ed', 'young', 'black', 'like'...]

sto cercando di ripulire .txt file (tutto in minuscolo, rimuovere parole non significative, ecc), normalizzare multipli ortografia di una parola in un unico e fare una frequenza dist/conteggio. So come fare FreqDist, ma qualche suggerimento su dove sto andando male con lo stemming

fonte

2013-09-26 user2221429

Is not arginare la normalizzazione che stai cercando? Dici che stai avendo problemi .. cosa hai provato? – Spaceghost

Qual è il tuo output previsto? a seconda del tuo compito, potresti avere bisogno di un lemmatizer invece di uno stelo, vedi http://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers – alvas

Ci sono diversi stemmi ben noti precodificati in NLTK, vedere http://nltk.org/api/nltk.stem.html e qui sotto viene mostrato un esempio.

>>> from nltk import stem 
>>> porter = stem.porter.PorterStemmer() 
>>> lancaster = stem.lancaster.LancasterStemmer() 
>>> snowball = stem.snowball.EnglishStemmer() 
>>> tokens = ['player', 'playa', 'playas', 'pleyaz'] 
>>> [porter(i) for i in tokens] 
>>> [porter.stem(i) for i in tokens] 
['player', 'playa', 'playa', 'pleyaz'] 
>>> [lancaster.stem(i) for i in tokens] 
['play', 'play', 'playa', 'pleyaz'] 
>>> [snowball.stem(i) for i in tokens] 
[u'player', u'playa', u'playa', u'pleyaz']

Ma ciò che probabilmente avete bisogno è una sorta di uno stemmer regex,

>>> from nltk import stem 
>>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$') 
>>> [rxstem.stem(i) for i in tokens] 
['play', 'play', 'play', 'pley']

fonte

2013-09-27 07:23:01 alvas

Ho modificato la mia domanda. Y = Ho provato il tuo regexStem e ho ottenuto più token. Non sono sicuro di dove sto andando male. – user2221429

cambia l'ultima riga in '[tupstem.stem (i) per i in tupclean se" pl "in tupclean e" y "in tupstem.stem (i)]'. Nella linguistica, lo spostamento delle vocali avviene e supponendo che i dittonghi rimangano e così come l'esordio, allora anche il gruppo di consonanti "pl" sarà presente in ortografia. – alvas

provato questo, ma non ha fatto davvero quello che speravo avrebbe fatto. grazie comunque! – user2221429

Stemming testo non strutturato in NLTK

risposta

Problemi correlati