malaytextr: An R package to process Malay text data. It offers a number of functions/datasets for analyzing and working with text data in the Malay language.
Install the latest version of this package by entering the following in R:
install.packages("malaytextr")Or you can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("zahiernasrudin/malaytextr")There is a data frame of Malay root words that can be used as a dictionary:
malayrootwords
# A tibble: 4,365 x 2
   `Col Word` `Root Word`
   <chr>      <chr>      
 1 ad         ada        
 2 ak         aku        
 3 akn        akan       
 4 ank        anak       
 5 ap         apa        
 6 awl        awal       
 7 bg         bagi       
 8 bkn        bukan      
 9 blm        belum      
10 bnjr       banjir     
# ... with 4,355 more rowsstem_malay() will find the root words in a dictionary,
in which the malayrootwords data frame can be used, then it
will remove “extra suffix”“,”prefix” and lastly “suffix”
To stem word “banyaknya”. It will return a data frame with the word “banyaknya” and the stemmed word “banyak”:
Note: ‘Root Word’ is now returned instead of ‘root_word’
stem_malay(word = "banyaknya", dictionary = malayrootwords)
'Root Word' is now returned instead of 'root_word'
   Col Word Root Word
1 banyaknya    banyakTo stem words in a data frame:
x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan"))
stem_malay(word = x, 
          dictionary = malayrootwords, 
          col_feature1 = "text")
  
'Root Word' is now returned instead of 'root_word'
     Col Word Root Word
1   banyaknya    banyak
2      sangat    sangat
3     terkedu      kedu
4 pengetahuan      tahuremove_url will remove all urls found in a string
x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try")
remove_url(x)
[1] "test "               "another one  to try"There is a data frame of Malay stop words:
malaystopwords
# A tibble: 512 x 1
   stopwords
   <chr>    
 1 ada      
 2 sampai   
 3 sana     
 4 itu      
 5 sangat   
 6 saya     
 7 jadi     
 8 se       
 9 agak     
10 jangan   
# ... with 502 more rowsThis lexicon includes words that have been labelled as positive or negative:
sentiment_general
# A tibble: 1,424 × 2
   Word      Sentiment
   <chr>     <chr>    
 1 aduan     Negative 
 2 agresif   Negative 
 3 amaran    Negative 
 4 anarki    Negative 
 5 ancaman   Negative 
 6 aneh      Negative 
 7 antagonis Negative 
 8 azab      Negative 
 9 babi      Negative 
10 bahaya    Negative 
# … with 1,414 more rowsThis dataset is a development version that aims to provide a standardized version of Malay words. It is designed to standardize words that have multiple variations/spellings
normalized
# A tibble: 153 × 2
   `Col Word` `Normalized Word`
   <chr>      <chr>            
 1 ad         ada              
 2 ak         aku              
 3 akn        akan             
 4 ank        anak             
 5 ap         apa              
 6 awl        awal             
 7 bg         bagi             
 8 bkn        bukan            
 9 blm        belum            
10 bnjr       banjir           
# … with 143 more rowsTo report a bug, please file an issue on Github
MIT License