Come mostrare il testo del corpus nel pacchetto R tm?

Sono completamente nuovo nel pacchetto R e tm, quindi scusate la mia stupida domanda ;-) Come posso mostrare il testo di un corpus di testo normale nel pacchetto R tm?Come mostrare il testo del corpus nel pacchetto R tm?

Ho caricato un corpus di 323 file di testo in un corpus:

src <- DirSource("Korpora/technologie") 
corpus <- Corpus(src)

Ma quando chiamo il corpus con:

corpus[[1]]

ottengo sempre un po 'di output come questo, invece di il testo corpus stesso:

<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 144 
Content: chars: 141 
Content: chars: 224 
Content: chars: 75 
Content: chars: 105

Come è possibile visualizzare il testo del corpus?

Grazie!

UPDATE campione riproducibile: ho provato con il built-in testo di esempio:

> data("crude") 
> crude 
<<VCorpus>> 
Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 20 
> crude[1] 
<<VCorpus>> 
Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 1 
> crude[[1]] 
<<PlainTextDocument>> 
Metadata: 15 
Content: chars: 527

Come posso stampare il testo dei documenti?

UPDATE 2: informazioni di sessione:

> sessionInfo() 
R version 3.1.3 (2015-03-09) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows 7 x64 (build 7601) Service Pack 1 

locale: 
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C     
[5] LC_TIME=German_Germany.1252  

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] tm_0.6-1 NLP_0.1-7 

loaded via a namespace (and not attached): 
[1] parallel_3.1.3 slam_0.1-32 tools_3.1.3

fonte

2015-05-25 Azrael

Benvenuti in SO. Fornire un esempio riproducibile minimo: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – lukeA

si può provare a convertire il testo in un corpus dataframe, e l'accesso al testo richiesto dal dataframe stesso. Ho usato i dati di esempio "grezzi" incorporati (dal pacchetto tm) come esempio.

data("crude") 
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F) 

dataframe[1,] 
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

fonte

2015-05-25 10:58:29

OK, funziona, grazie. Forse ci sarà un bugfix per TM in futuro;) – Azrael

> inspect(crude[1]) 
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>> 

$`reut-00001.xml` 
<<PlainTextDocument (metadata: 15)>> 
Diamond Shamrock Corp said that 
effective today it had cut its contract prices for crude oil by 
1.50 dlrs a barrel. 
    The reduction brings its posted price for West Texas 
Intermediate to 16.00 dlrs a barrel, the copany said. 
    "The price reduction today was made in the light of falling 
oil product prices and a weak crude oil market," a company 
spokeswoman said. 
    Diamond is the latest in a line of U.S. oil companies that 
have cut its contract, or posted, prices over the last two days 
citing weak oil markets. 
Reuter

fonte

2015-05-25 09:53:07 Ricky

Spiacente, non ha funzionato:
> inspect (greggio [1]) <> Metadati: specifica corpus: 0, livello di documento (indicizzato): 0 contenuto: documenti: 1' $ 'Reut-00001.xml' << >> PlainTextDocument metadati: 15 contenuto: caratteri: 527 > – Azrael

Questo è interessante, funziona bene sul mio. Puoi provare 'grezzo [1] $ contenuto'? – Ricky

Lo stesso. Uso RStudio, forse è questo il problema o mi sono perso alcune impostazioni in RStudio?
UPDATE: Stessa console R – Azrael

Posso confermare che a partire da tm 0.6-1 l'ispezione non viene eseguita correttamente. Si può abbinare con il pacchetto diqdap che io sostengo per convertire facilmente ad una data.frame come folows:

library(qdap) as.data.frame(crude)

Per rendere più ike il vecchio ispezionare comportamento è possibile utilizzare:

as.data.frame(crude) %>% with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))

Questo appare come:

Diamond Shamrock Corp said that effective today it had cut its contract prices for crude oil by 1.50 dlrs a barrel. The reduction brings its posted price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said. "The price reduction today was made in the light of falling oil product prices and a weak crude oil market," a company spokeswoman said. Diamond is the latest in a line of U.S. oil companies that have cut its contract, or posted, prices over the last two days citing weak oil markets. Reuter OPEC may be forced to meet before a scheduled June session to readdress its production cutting agreement if the organization wants to halt the current slide in oil prices, oil industry analysts said. "The movement to higher oil prices was never to be as easy as OPEC thought. They may need an emergency meeting to sort out the problems," said Daniel Yergin, director of Cambridge Energy Research Associates, CERA. Analysts and oil industry sources said the problem OPEC faces is excess oil supply in world oil markets. "OPEC's problem is not a price problem but a production issue and must be addressed in that way," said Paul Mlotok, oil analyst with Salomon Brothers Inc. He said the market's earlier optimism about OPE . . .

fonte

2015-05-25 14:27:17

Questo funziona nella mia, per stampare il testo contenuto, con l'ultima versione di tm,

corpus[[1]]$content

Nota: più o meno come suggerito da Ricky nel commento precedente. Scusa, volevo scrivere un commento, solo il mio rappresentante ha solo 25 anni (necessita di un minimo di 50 rappresentanti per commentare).

fonte

2015-06-30 03:30:59 silo

Questo funziona. Qualcuno sa perché questo deve essere aggiunto? Le parentesi utilizzate per lavorare da soli senza aggiungere $ contenuti –

Ecco un modo semplice e diretto per visualizzare il testo di un corpus:

strwrap(corpus[[1]])

Per i dati grezzi questa uscita volontà

[1] "Diamond Shamrock Corp said that effective today it had cut its contract"  
[2] "prices for crude oil by 1.50 dlrs a barrel. The reduction brings its posted" 
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said." 
[4] "\"The price reduction today was made in the light of falling oil product"  
[5] "prices and a weak crude oil market,\" a company spokeswoman said. Diamond is" 
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"  
[7] "posted, prices over the last two days citing weak oil markets. Reuter"

fonte

2016-03-25 10:58:42

-1

Dal Vignette tm, questo funziona:

writeLines(as.character(doc.corpus[[8]]))

Dove '8' è l'elemento numero che si desidera h

fonte

2016-07-14 00:09:34

Possiamo ottenere il content di ogni elemento nel corpo.

data("crude") 
out <- sapply(crude, function(x){x$content}) 
out 

# optionally export 
writeCorpus(out, "outputdir/", filenames = "corpus.txt")

fonte

2016-08-19 11:22:44 Selva

Come mostrare il testo del corpus nel pacchetto R tm?

risposta

Problemi correlati