| Type: | Package |
| Title: | Big Data Preprocessing Architecture |
| Version: | 3.1.0 |
| Description: | Provide a tool to easily build customized data flows to pre-process large volumes of information from different sources. To this end, 'bdpar' allows to (i) easily use and create new functionalities and (ii) develop new data source extractors according to the user needs. Additionally, the package provides by default a predefined data flow to extract and pre-process the most relevant information (tokens, dates, ... ) from some textual sources (SMS, Email, YouTube comments). |
| Date: | 2023-12-11 |
| License: | GPL-3 |
| URL: | https://github.com/miferreiro/bdpar |
| BugReports: | https://github.com/miferreiro/bdpar/issues |
| Depends: | R (≥ 3.5.0) |
| Imports: | digest, parallel, R6, rlist, tools, utils |
| Suggests: | cld2, knitr, rex, rjson, rmarkdown, stringi, stringr, testthat (≥ 2.3.1), tuber |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.2.3 |
| SystemRequirements: | Python (>= 2.7 or >= 3.6) |
| Encoding: | UTF-8 |
| NeedsCompilation: | no |
| Collate: | 'AbbreviationPipe.R' 'bdpar.log.R' 'wrapper.R' 'Bdpar.R' 'BdparOptions.R' 'Connections.R' 'ContractionPipe.R' 'DefaultPipeline.R' 'DynamicPipeline.R' 'ExtractorEml.R' 'ExtractorFactory.R' 'ExtractorSms.R' 'ExtractorYtbid.R' 'File2Pipe.R' 'FindEmojiPipe.R' 'FindEmoticonPipe.R' 'FindHashtagPipe.R' 'FindUrlPipe.R' 'FindUserNamePipe.R' 'GenericPipe.R' 'GenericPipeline.R' 'GuessDatePipe.R' 'GuessLanguagePipe.R' 'Instance.R' 'InterjectionPipe.R' 'MeasureLengthPipe.R' 'ResourceHandler.R' 'SlangPipe.R' 'StopWordPipe.R' 'StoreFileExtPipe.R' 'TargetAssigningPipe.R' 'TeeCSVPipe.R' 'ToLowerCasePipe.R' 'bdpar.Options.R' 'bdparData.R' 'eml.R' 'emojisData.R' 'operator-pipe.R' 'runPipeline.R' 'zzz.R' |
| Packaged: | 2023-12-12 17:32:47 UTC; Maite |
| Author: | Miguel Ferreiro-Díaz [aut, cre], David Ruano-Ordás [aut, ctr], Tomás R. Cotos-Yañez [aut, ctr], José Ramón Méndez Reboredo [aut, ctr], University of Vigo [cph] |
| Maintainer: | Miguel Ferreiro-Díaz <miguel.ferreiro.diaz@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2023-12-12 18:00:10 UTC |
Class to find and/or replace the abbreviations on the data field of an Instance
Description
AbbreviationPipe class is responsible for detecting
the existing abbreviations in the data field of each Instance.
Identified abbreviations are stored inside the abbreviation field of
Instance class. Moreover if needed, is able to perform inline
abbreviations replacement.
Details
AbbreviationPipe class requires the resource files (in json format)
containing the correspondence between abbreviations and meaning. To this end,
the language of the text indicated in the propertyLanguageName should
be contained in the resource file name (ie. abbrev.xxx.json where xxx is the
value defined in the propertyLanguageName ). The location of the
resources should be defined in the "resources.abbreviations.path"
field of bdpar.Options variable.
Note
AbbreviationPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> AbbreviationPipe
Methods
Public methods
Inherited methods
Method new()
Creates a AbbreviationPipe object.
Usage
AbbreviationPipe$new(
propertyName = "abbreviation",
propertyLanguageName = "language",
alwaysBeforeDeps = list("GuessLanguagePipe"),
notAfterDeps = list(),
replaceAbbreviations = TRUE,
resourcesAbbreviationsPath = NULL
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.propertyLanguageNameA
charactervalue. Name of the language property.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).replaceAbbreviationsA
logicalvalue. Indicates if the abbreviations are replaced or not.resourcesAbbreviationsPathA
charactervalue. Path of resource files (in json format) containing the correspondence between abbreviations and meaning.
Method pipe()
Preprocesses the Instance to obtain/replace
the abbreviations. The abbreviations found in the data are added to the
list of properties of the Instance.
Usage
AbbreviationPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findAbbreviation()
Checks if the abbreviation is in the data.
Usage
AbbreviationPipe$findAbbreviation(data, abbreviation)
Arguments
Returns
A logical value depending on whether the
abbreviation is in the data.
Method replaceAbbreviation()
Replaces the abbreviation in the data for the extendedAbbreviation.
Usage
AbbreviationPipe$replaceAbbreviation(abbreviation, extendedAbbreviation, data)
Arguments
Returns
The data with the abbreviations replaced.
Method getPropertyLanguageName()
Gets the name of property language.
Usage
AbbreviationPipe$getPropertyLanguageName()
Returns
Value of name of property language.
Method getResourcesAbbreviationsPath()
Gets the path of abbreviations resources.
Usage
AbbreviationPipe$getResourcesAbbreviationsPath()
Returns
Value of path of abbreviations resources.
Method setResourcesAbbreviationsPath()
Sets the path of abbreviations resources.
Usage
AbbreviationPipe$setResourcesAbbreviationsPath(path)
Arguments
pathA
charactervalue. The new value of the path of abbreviations resources.
Method clone()
The objects of this class are cloneable with this method.
Usage
AbbreviationPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.Options, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessDatePipe, GuessLanguagePipe,
Instance, InterjectionPipe,
MeasureLengthPipe, GenericPipe,
ResourceHandler, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to manage the preprocess of the files throughout the flow of pipes
Description
Bdpar class provides the static variables required
to perform the whole data flow process. To this end Bdpar is
in charge of (i) initialize the objects of handle the connections to APIs
(Connections) and handles json resources (ResourceHandler)
and (ii) executing the flow of pipes (inherited from GenericPipeline class)
passed as argument.
Details
In the case that some pipe, defined on the workflow, needs some type of configuration, it can be defined through bdpar.Options variable which have different methods to support the functionality of different pipes.
Static variables
- connections:
-
(Connections) object that handles the connections with YouTube and Twitter.
- resourceHandler:
-
(ResourceHandler) object that handles the json resources files.
Methods
Public methods
Method new()
Creates a Bdpar object. Initializes the static variables: connections and resourceHandler.
Usage
Bdpar$new()
Method execute()
Preprocess files through the indicated flow of pipes.
Usage
Bdpar$execute( path, extractors = ExtractorFactory$new(), pipeline = DefaultPipeline$new(), cache = TRUE, verbose = FALSE, summary = FALSE )
Arguments
pathA
charactervalue. The path where the files to be processed are located.extractorsA
ExtractorFactoryvalue. Class which implements thecreateInstancemethod to choose which type ofInstanceis created.pipelineA
GenericPipelinevalue. Subclass ofGenericPipeline, which implements theexecutemethod. By default, it is theDefaultPipelinepipeline.cache(logical) flag indicating if the status of the instances will be stored after each pipe. This allows to avoid rejections of previously executed tasks, if the order and configuration of the pipe and pipeline is the same as what is stored in the cache.
verbose(logical) flag indicating for printing messages, warnings and errors.
summary(logical) flag indicating if a summary of the pipeline execution is provided or not.
Details
In case of wanting to parallelize, it is necessary to indicate the number of cores to be used through bdpar.Options$set("numCores", numCores)
Returns
The list of Instances that have been preprocessed.
Method clone()
The objects of this class are cloneable with this method.
Usage
Bdpar$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.Options, Connections,
DefaultPipeline, DynamicPipeline,
GenericPipeline, Instance,
ExtractorFactory, ResourceHandler,
runPipeline
Examples
## Not run:
#If it is necessary to indicate any configuration, do it through:
#bdpar.Options$set(key, value)
#If the key is not initialized, do it through:
#bdpar.Options$add(key, value)
#If it is necessary parallelize, do it through:
#bdpar.Options$set("numCores", numCores)
#If it is necessary to change the behavior of the log, do it through:
#bdpar.Options$configureLog(console = TRUE, threshold = "INFO", file = NULL)
#Folder with the files to preprocess
path <- system.file("example",
package = "bdpar")
#Object which decides how creates the instances
extractors <- ExtractorFactory$new()
#Object which indicates the pipes' flow
pipeline <- DefaultPipeline$new()
objectBdpar <- Bdpar$new()
#Starting file preprocessing...
objectBdpar$execute(path = path,
extractors = extractors,
pipeline = pipeline,
cache = FALSE,
verbose = FALSE,
summary = TRUE)
## End(Not run)
Class to manage the connections with YouTube
Description
The tasks of the functions that the Connections
class has are to establish the connections and control the number of requests
that have been made with the API of YouTube.
Details
The way to indicate the keys of YouTube has to be through fields of bdpar.Options variable:
[youtube]
- bdpar.Options$set("youtube.app.id", <<app_id>>)
- bdpar.Options$set("youtube.app.password", <<app_password>>)
Note
Fields of unused connections will be automatically ignored by the platform.
Methods
Public methods
Method new()
Creates a Connections object.
Usage
Connections$new()
Method startConnectionWithYoutube()
Function able to establish the connection with YouTube.
Usage
Connections$startConnectionWithYoutube()
Method addNumRequestToYoutube()
Function that increases in one the number of request to YouTube.
Usage
Connections$addNumRequestToYoutube()
Method checkRequestToYoutube()
Handles the connection with YouTube.
Usage
Connections$checkRequestToYoutube()
Method getNumRequestMaxToYoutube()
Gets the number of maximum requests allowed by YouTube API.
Usage
Connections$getNumRequestMaxToYoutube()
Returns
Value of number maximum of request to YouTube.
Method clone()
The objects of this class are cloneable with this method.
Usage
Connections$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
Class to find and/or replace the contractions on the data field of a Instance
Description
ContractionPipe class is responsible for detecting
the existing contractions in the data field of each Instance.
Identified contractions are stored inside the contraction field of
Instance class. Moreover if needed, is able to perform inline
contractions replacement.
Details
ContractionPipe class requires the resource files (in json format)
containing the correspondence between contractions and meaning. To this end,
the language of the text indicated in the propertyLanguageName should
be contained in the resource file name (ie. contr.xxx.json where xxx is the
value defined in the propertyLanguageName ). The location of the
resources should be defined in the "resources.contractions.path"
field of bdpar.Options variable.
Note
ContractionPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> ContractionPipe
Methods
Public methods
Inherited methods
Method new()
Creates a ContractionPipe object.
Usage
ContractionPipe$new(
propertyName = "contractions",
propertyLanguageName = "language",
alwaysBeforeDeps = list("GuessLanguagePipe"),
notAfterDeps = list(),
replaceContractions = TRUE,
resourcesContractionsPath = NULL
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.propertyLanguageNameA
charactervalue. Name of the language property.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).replaceContractionsA
logicalvalue. Indicates if the contractions are replaced or not.resourcesContractionsPathA
charactervalue. Path of resource files (in json format) containing the correspondence between contractions and meaning.
Method pipe()
Preprocesses the Instance to obtain/replace
the contractions. The contractions found in the data are added to the
list of properties of the Instance.
Usage
ContractionPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findContraction()
Checks if the contraction is in the data.
Usage
ContractionPipe$findContraction(data, contraction)
Arguments
Returns
A logical value depending on whether the
contraction is in the data.
Method replaceContraction()
Replaces the contraction in the data for the extendedContraction.
Usage
ContractionPipe$replaceContraction(contraction, extendedContraction, data)
Arguments
Returns
The data with the contractions replaced.
Method getPropertyLanguageName()
Gets the name of property language.
Usage
ContractionPipe$getPropertyLanguageName()
Returns
Value of name of property language.
Method getResourcesContractionsPath()
Gets the path of contractions resources.
Usage
ContractionPipe$getResourcesContractionsPath()
Returns
Value of path of contractions resources.
Method setResourcesContractionsPath()
Sets the path of contractions resources.
Usage
ContractionPipe$setResourcesContractionsPath(path)
Arguments
pathA
charactervalue. The new value of the path of contractions resources.
Method clone()
The objects of this class are cloneable with this method.
Usage
ContractionPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.Options,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessDatePipe, GuessLanguagePipe,
Instance, InterjectionPipe,
MeasureLengthPipe, GenericPipe,
ResourceHandler, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class implementing a default pipelining process.
Description
This DefaultPipeline class inherits from the
GenericPipeline class. Includes the execute method which
provides a default pipelining implementation.
Details
The default flow is:
instance %>|% TargetAssigningPipe$new() %>|% StoreFileExtPipe$new() %>|% GuessDatePipe$new() %>|% File2Pipe$new() %>|% MeasureLengthPipe$new(propertyName = "length_before_cleaning_text") %>|% FindUserNamePipe$new() %>|% FindHashtagPipe$new() %>|% FindUrlPipe$new() %>|% FindEmoticonPipe$new() %>|% FindEmojiPipe$new() %>|% GuessLanguagePipe$new() %>|% ContractionPipe$new() %>|% AbbreviationPipe$new() %>|% SlangPipe$new() %>|% ToLowerCasePipe$new() %>|% InterjectionPipe$new() %>|% StopWordPipe$new() %>|% MeasureLengthPipe$new(propertyName = "length_after_cleaning_text") %>|% TeeCSVPipe$new()
Inherit
This class inherits from GenericPipeline and implements the
execute abstract function.
Super class
bdpar::GenericPipeline -> DefaultPipeline
Methods
Public methods
Method new()
Creates a DefaultPipeline object.
Usage
DefaultPipeline$new()
Method execute()
Function where is implemented the flow of the
GenericPipes.
Usage
DefaultPipeline$execute(instance)
Arguments
Returns
The preprocessed Instance.
Method get()
Gets a list with containing the set of
link{GenericPipe}s of the pipeline,
Usage
DefaultPipeline$get()
Returns
The set of GenericPipes containing the pipeline.
Method print()
Prints pipeline representation. (Override print function)
Usage
DefaultPipeline$print(...)
Arguments
...Further arguments passed to or from other methods.
Method toString()
Returns a character representing the pipeline
Usage
DefaultPipeline$toString()
Returns
DefaultPipeline character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
DefaultPipeline$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.log, Instance,
DynamicPipeline, GenericPipeline,
GenericPipe, %>|%
Class implementing a dynamic pipelining process
Description
This DynamicPipeline class inherits from the
GenericPipeline class. Includes the execute method
which provides a dynamic pipelining implementation.
'
Inherit
This class inherits from GenericPipeline and implements the
execute abstract function.
Super class
bdpar::GenericPipeline -> DynamicPipeline
Methods
Public methods
Method new()
Creates a DynamicPipeline object.
Usage
DynamicPipeline$new(pipeline = NULL)
Arguments
pipelineA
listofGenericPipeobjects. Initializes the flow ofGenericPipe.
Method add()
Adds a GenericPipe or a
GenericPipe list to the pipeline.
Usage
DynamicPipeline$add(pipe, pos = NULL)
Arguments
pipeA
GenericPipeobject or alistofGenericPipeobjects.posA (numeric) value. The value of the position to add. If it is NULL,
GenericPipeis appended to the pipeline.
Method removeByPos()
Removes GenericPipes by the position on the
pipeline.
Usage
DynamicPipeline$removeByPos(pos)
Arguments
posA (numeric) value. The value of the position to remove.
Method removeByPipe()
Removes GenericPipes by its name on the
pipeline.
Usage
DynamicPipeline$removeByPipe(pipe.name)
Arguments
pipe.nameA (character) value. The
GenericPipesname to remove.
Method removeAll()
Removes all GenericPipes included on pipeline.
Usage
DynamicPipeline$removeAll()
Method execute()
Function where is implemented the flow of the
GenericPipes.
Usage
DynamicPipeline$execute(instance)
Arguments
instanceA (Instance) value. The
Instancethat is going to be processed.
Method get()
Gets a list with containing the set of GenericPipes
of the pipeline.
Usage
DynamicPipeline$get()
Returns
The set of GenericPipes containing the pipeline.
Method print()
Prints pipeline representation. (Override print function)
Usage
DynamicPipeline$print(...)
Arguments
...Further arguments passed to or from other methods.
Method toString()
Returns a character representing the pipeline
Usage
DynamicPipeline$toString()
Returns
DynamicPipeline character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
DynamicPipeline$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.log, Instance,
DefaultPipeline, GenericPipeline,
GenericPipe, %>|%
Class to handle email files with eml extension
Description
This class inherits from the Instance class and
implements the functions of extracting the text and the date from an eml type
file.
Details
The way to indicate which part to choose in the email, when is a multipart email,
is through the "extractorEML.mpaPartSelected"
field of bdpar.Options variable.
Note
To be able to use this class it is necessary to have Python installed.
Inherit
This class inherits from Instance and implements the
obtainSource and obtainDate abstracts functions.
Super class
bdpar::Instance -> ExtractorEml
Methods
Public methods
Inherited methods
bdpar::Instance$addBanPipes()bdpar::Instance$addFlowPipes()bdpar::Instance$addProperties()bdpar::Instance$checkCompatibility()bdpar::Instance$getBanPipes()bdpar::Instance$getData()bdpar::Instance$getDate()bdpar::Instance$getFlowPipes()bdpar::Instance$getNamesOfProperties()bdpar::Instance$getPath()bdpar::Instance$getProperties()bdpar::Instance$getSource()bdpar::Instance$getSpecificProperty()bdpar::Instance$invalidate()bdpar::Instance$isInstanceValid()bdpar::Instance$isSpecificProperty()bdpar::Instance$setData()bdpar::Instance$setDate()bdpar::Instance$setProperties()bdpar::Instance$setSource()bdpar::Instance$setSpecificProperty()
Method new()
Creates a ExtractorEml object.
Usage
ExtractorEml$new(path, PartSelectedOnMPAlternative = NULL)
Arguments
pathA
charactervalue. Path of the eml file.PartSelectedOnMPAlternativeA
charactervalue. Configuration to read the eml files. If it is NULL, checks if is defined in the "extractorEML.mpaPartSelected" field of bdpar.Options variable.
Method obtainDate()
Obtains the date of the eml file. Calls the function read_emails and obtains the date of the file indicated in the path and then transforms it into the generic date format, that is "%a %b %d %H:%M:%S %Z %Y" (Example: "Thu May 02 06:52:36 UTC 2013").
Usage
ExtractorEml$obtainDate()
Method obtainSource()
Obtains the source of the eml file. Calls the function read_emails and obtains the source of the file indicated in the path. In addition, it initializes the data with the initial source.
Usage
ExtractorEml$obtainSource()
Method getPartSelectedOnMPAlternative()
Gets of PartSelectedOnMPAlternative variable.
Usage
ExtractorEml$getPartSelectedOnMPAlternative()
Returns
Value of PartSelectedOnMPAlternative variable.
Method setPartSelectedOnMPAlternative()
Gets of PartSelectedOnMPAlternative variable.
Usage
ExtractorEml$setPartSelectedOnMPAlternative(PartSelectedOnMPAlternative)
Arguments
PartSelectedOnMPAlternativeA
charactervalue. The new value of PartSelectedOnMPAlternative variable.
Method toString()
Returns a character representing the instance
Usage
ExtractorEml$toString()
Returns
Instance character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
ExtractorEml$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.Options, ExtractorSms,
ExtractorYtbid, Instance
Class to handle the creation of Instance types
Description
ExtractorFactory class builds the appropriate
Instance object according to the file extension. In the case
of not finding the registered extension, the default extractor will be used
if it has been previously configured.
Methods
Public methods
Method new()
Creates a ExtractorFactory object.
Usage
ExtractorFactory$new()
Method registerExtractor()
Adds an extractor to the list of extensions. If the extension is an empty string (""), the indicated extractor will be the default when there is no extractor associated with an extension.
Usage
ExtractorFactory$registerExtractor(extensions, extractor)
Arguments
extensionsA
characterarray. The names of the extension option.extractorA
Objectvalue. The extractor of the new extension.
Method setExtractor()
Modifies the extractor of the one extension.
Usage
ExtractorFactory$setExtractor(extension, extractor)
Arguments
extensionA
charactervalue. The name of the extension option.extractorA
Objectvalue. The value of the new extractor.
Method setDefaultExtractor()
Modifies the extractor of the one extension. Assign NULL value to disable the default extractor.
Usage
ExtractorFactory$setDefaultExtractor(defaultExtractor)
Arguments
defaultExtractorA
Objectvalue. The value of the default extractor.
Method removeExtractor()
Removes a specific extractor thought the extension.
Usage
ExtractorFactory$removeExtractor(extension)
Arguments
extensionA
charactervalue. The name of the extension to remove.
Method getAllExtractors()
Gets the list of extractors.
Usage
ExtractorFactory$getAllExtractors()
Returns
Value of extractors.
Method getDefaultExtractor()
Gets the default extractor.
Usage
ExtractorFactory$getDefaultExtractor()
Returns
Value of default extractor.
Method isSpecificExtractor()
Checks if exists an extractor for a specific extension.
Usage
ExtractorFactory$isSpecificExtractor(extension)
Arguments
extensionA
charactervalue. The name of the extension to check
Returns
Value of extractors.
Method createInstance()
Builds the Instance object according to the
file extension. In the case of not finding the registered extension, the
default extractor will be used if it has been previously configured.
Usage
ExtractorFactory$createInstance(path)
Arguments
Returns
The Instance corresponding object according to the
file extension.
Method reset()
Resets list of extractor to default state.
Usage
ExtractorFactory$reset()
Method print()
Prints pipeline representation. (Override print function)
Usage
ExtractorFactory$print(...)
Arguments
...Further arguments passed to or from other methods.
Method clone()
The objects of this class are cloneable with this method.
Usage
ExtractorFactory$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
ExtractorEml, ExtractorSms,
Instance
Class to handle SMS files with tsms extension
Description
This class that inherits from the Instance class and
implements the functions of extracting the text and the date of an tsms type file.
Details
Due to the fact that the creation date of the message can not be extracted from the text of an SMS, the date will be initialized to empty.
Inherit
This class inherits from Instance and implements the
obtainSource and obtainDate abstracts functions.
Super class
bdpar::Instance -> ExtractorSms
Methods
Public methods
Inherited methods
bdpar::Instance$addBanPipes()bdpar::Instance$addFlowPipes()bdpar::Instance$addProperties()bdpar::Instance$checkCompatibility()bdpar::Instance$getBanPipes()bdpar::Instance$getData()bdpar::Instance$getDate()bdpar::Instance$getFlowPipes()bdpar::Instance$getNamesOfProperties()bdpar::Instance$getPath()bdpar::Instance$getProperties()bdpar::Instance$getSource()bdpar::Instance$getSpecificProperty()bdpar::Instance$invalidate()bdpar::Instance$isInstanceValid()bdpar::Instance$isSpecificProperty()bdpar::Instance$setData()bdpar::Instance$setDate()bdpar::Instance$setProperties()bdpar::Instance$setSource()bdpar::Instance$setSpecificProperty()
Method new()
Creates a ExtractorSms object.
Usage
ExtractorSms$new(path)
Arguments
pathA
charactervalue. Path of the tsms file.
Method obtainDate()
Obtains the date of the SMS file.
Usage
ExtractorSms$obtainDate()
Method obtainSource()
Obtains the source of the SMS file. Reads the file indicated in the path. In addition, it initializes the data field with the initial source.
Usage
ExtractorSms$obtainSource()
Method toString()
Returns a character representing the instance
Usage
ExtractorSms$toString()
Returns
Instance character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
ExtractorSms$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
ExtractorEml, ExtractorYtbid,
Instance
Class to handle comments of YouTube files with ytbid extension
Description
This class inherits from the Instance class and
implements the functions of extracting the text and the date of an ytbid type file.
Details
YouTube connection is handled through the Connections class
which loads the YouTube API credentials from the bdpar.Options object.
Additionally, to increase the processing speed, each Youtube query is stored
in a cache to avoid the execution of duplicated queries. To enable this option,
cache location should be in the "cache.youtube.path" field of
bdpar.Options variable. This variable has to be the
path to store the comments and it is necessary that it has two folder named:
"_spam_" and "_ham_"
Inherit
This class inherits from Instance and implements the
obtainSource and obtainDate abstracts functions.
Super class
bdpar::Instance -> ExtractorYtbid
Methods
Public methods
Inherited methods
bdpar::Instance$addBanPipes()bdpar::Instance$addFlowPipes()bdpar::Instance$addProperties()bdpar::Instance$checkCompatibility()bdpar::Instance$getBanPipes()bdpar::Instance$getData()bdpar::Instance$getDate()bdpar::Instance$getFlowPipes()bdpar::Instance$getNamesOfProperties()bdpar::Instance$getPath()bdpar::Instance$getProperties()bdpar::Instance$getSource()bdpar::Instance$getSpecificProperty()bdpar::Instance$invalidate()bdpar::Instance$isInstanceValid()bdpar::Instance$isSpecificProperty()bdpar::Instance$setData()bdpar::Instance$setDate()bdpar::Instance$setProperties()bdpar::Instance$setSource()bdpar::Instance$setSpecificProperty()
Method new()
Creates a ExtractorYtbid object.
Usage
ExtractorYtbid$new(path, cachePath = NULL)
Arguments
pathA
charactervalue. Path of the ytbid file.cachePathA
charactervalue. Path of the cache location. If it is NULL, checks if is defined in the "cache.youtube.path" field ofbdpar.Optionsvariable.
Method obtainId()
Obtains the ID of the specific Youtube's comment. Reads the ID of the file indicated in the variable path.
Usage
ExtractorYtbid$obtainId()
Method getId()
Gets the ID of an specific Youtube's comment.
Usage
ExtractorYtbid$getId()
Returns
Value of Youtube's comment ID.
Method obtainDate()
Obtains the date from a specific comment ID. If the comment has been previously cached the comment date is loaded from cache path. Otherwise, the request is perfomed using YouTube API and the date is then formatted to the established standard.
Usage
ExtractorYtbid$obtainDate()
Method obtainSource()
Obtains the source from a specific comment ID. If the comment has previously been cached the source is loaded from cache path. Otherwise, the request is performed using on YouTube API.
Usage
ExtractorYtbid$obtainSource()
Method toString()
Returns a character representing the instance
Usage
ExtractorYtbid$toString()
Returns
Instance character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
ExtractorYtbid$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.Options, Connections,
ExtractorEml, ExtractorSms,
Instance
Class to obtain the source field of an Instance
Description
Obtains the source using the method which implements the
subclass of Instance.
Note
File2Pipe will automatically invalidate the
Instance whenever the obtained source is empty or not in UTF-8 format.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> File2Pipe
Methods
Public methods
Inherited methods
Method new()
Creates a File2Pipe object.
Usage
File2Pipe$new(
propertyName = "source",
alwaysBeforeDeps = list("TargetAssigningPipe"),
notAfterDeps = list()
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).
Method pipe()
Preprocesses the Instance to obtain the
source.
Usage
File2Pipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method clone()
The objects of this class are cloneable with this method.
Usage
File2Pipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to find and/or replace the emoji on the data field of an Instance
Description
This class is responsible of detecting the existing emojis in the
data field of each Instance. Identified emojis are
stored inside the emoji field of Instance class.
Moreover if required, is able to perform inline emoji replacement.
Details
FindEmojiPipe use the emoji list provided by data(emojisData).
Note
FindEmojiPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> FindEmojiPipe
Methods
Public methods
Inherited methods
Method new()
Creates a FindEmojiPipe object.
Usage
FindEmojiPipe$new( propertyName = "Emojis", alwaysBeforeDeps = list(), notAfterDeps = list(), replaceEmojis = TRUE )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).replaceEmojisA
logicalvalue. Indicates if the emojis are replaced.propertyLanguageNameA
charactervalue. Name of the language property.
Method pipe()
Preprocesses the Instance to obtain/replace
the emojis. The emojis found in the data are added to the
list of properties of the Instance.
Usage
FindEmojiPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findEmoji()
Checks if the emoji is in the data.
Usage
FindEmojiPipe$findEmoji(data, emoji)
Arguments
Returns
A logical value depending on whether the
emoji is in the data.
Method replaceEmoji()
Replaces the emoji in the data for the extendedEmoji.
Usage
FindEmojiPipe$replaceEmoji(emoji, extendedEmoji, data)
Arguments
Returns
The data with the emojis replaced.
Method clone()
The objects of this class are cloneable with this method.
Usage
FindEmojiPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to find and/or remove the emoticons on the data field of an Instance
Description
This class is responsible of detecting the existing emoticons in the
data field of each Instance. Identified emoticons are
stored inside the emoticon field of Instance class.
Moreover if required, is able to perform inline emoticon removement.
Details
The regular expression indicated in the emoticonPattern
variable is used to identify emoticons.
Note
FindEmoticonPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> FindEmoticonPipe
Public fields
emoticonPatternA
charactervalue. The regular expression to detect emoticons.
Methods
Public methods
Inherited methods
Method new()
Creates a FindEmoticonPipe object.
Usage
FindEmoticonPipe$new(
propertyName = "emoticon",
alwaysBeforeDeps = list(),
notAfterDeps = list("FindHashtagPipe"),
removeEmoticons = TRUE
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).removeEmoticonsA
logicalvalue. Indicates if the emoticons are removed.propertyLanguageNameA
charactervalue. Name of the language property.
Method pipe()
Preprocesses the Instance to obtain/remove
the emoticons. The emoticons found in the data are added to the
list of properties of the Instance.
Usage
FindEmoticonPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findEmoticon()
Finds the emoticons in the data.
Usage
FindEmoticonPipe$findEmoticon(data)
Arguments
dataA
charactervalue. The text to search the emoticons.
Returns
The list with emoticons found.
Method removeEmoticon()
Removes the emoticons in the data.
Usage
FindEmoticonPipe$removeEmoticon(data)
Arguments
dataA
charactervalue. The text where emoticons will be removed.
Returns
The data with the emoticons removed.
Method clone()
The objects of this class are cloneable with this method.
Usage
FindEmoticonPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to find and/or remove the hashtags on the data field of an Instance
Description
This class is responsible of detecting the existing hashtags in the
data field of each Instance. Identified hashtags are
stored inside the hashtag field of Instance class.
Moreover if required, is able to perform inline hashtag removement.
Details
The regular expression indicated in the hashtagPattern
variable is used to identify hashtags.
Note
FindHashtagPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> FindHashtagPipe
Public fields
hashtagPatternA
charactervalue. The regular expression to detect hashtags.
Methods
Public methods
Inherited methods
Method new()
Creates a FindHashtagPipe object.
Usage
FindHashtagPipe$new( propertyName = "hashtag", alwaysBeforeDeps = list(), notAfterDeps = list(), removeHashtags = TRUE )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).removeHashtagsA
logicalvalue. Indicates if the hashtags are removed.propertyLanguageNameA
charactervalue. Name of the language property.
Method pipe()
Preprocesses the Instance to obtain/remove
the hashtags. The hashtags found in the data are added to the
list of properties of the Instance.
Usage
FindHashtagPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findHashtag()
Finds the hashtags in the data.
Usage
FindHashtagPipe$findHashtag(data)
Arguments
dataA
charactervalue. The text to search the hashtags.
Returns
The list with hashtags found.
Method removeHashtag()
Removes the hashtags in the data.
Usage
FindHashtagPipe$removeHashtag(data)
Arguments
dataA
charactervalue. The text where hashtags will be removed.
Returns
The data with the hashtags removed.
Method clone()
The objects of this class are cloneable with this method.
Usage
FindHashtagPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to find and/or remove the URLs on the data field of an Instance
Description
This class is responsible of detecting the existing URLs in the
data field of each Instance. Identified URLs are
stored inside the URLs field of Instance class.
Moreover if required, is able to perform inline URLs removement.
Details
The regular expressions indicated in the URLPatterns
variable are used to identify URLs.
Note
FindUrlPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> FindUrlPipe
Public fields
Methods
Public methods
Inherited methods
Method new()
Creates a FindUrlPipe object.
Usage
FindUrlPipe$new(
propertyName = "URLs",
alwaysBeforeDeps = list(),
notAfterDeps = list("FindUrlPipe"),
removeUrls = TRUE,
URLPatterns = list(self$URLPattern, self$EmailPattern),
namesURLPatterns = list("UrlPattern", "EmailPattern")
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).removeUrlsA
logicalvalue. Indicates if the URLs are removed.URLPatternsA
listvalue. The regex to find URLs.namesURLPatternsA
listvalue. The names of regex.propertyLanguageNameA
charactervalue. Name of the language property.
Method pipe()
Preprocesses the Instance to obtain/remove
the URLs. The URLs found in the data are added to the
list of properties of the Instance.
Usage
FindUrlPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findUrl()
Finds the URLs in the data.
Usage
FindUrlPipe$findUrl(pattern, data)
Arguments
Returns
The list with URLs found.
Method removeUrl()
Removes the URL in the data.
Usage
FindUrlPipe$removeUrl(pattern, data)
Arguments
Returns
The data with URLs removed.
Method putNamesURLPattern()
Sets the names to URL patterns result.
Usage
FindUrlPipe$putNamesURLPattern(resultOfURLPatterns)
Arguments
resultOfURLPatternsA
listvalue. The list with URLs found.
Returns
The URLs found with the names of URL pattern.
Method getURLPatterns()
Gets the URL patterns.
Usage
FindUrlPipe$getURLPatterns()
Returns
Value of URL patterns.
Method setURLPatterns()
Sets the URL patterns.
Usage
FindUrlPipe$setURLPatterns(URLPatterns)
Arguments
URLPatternsA
listvalue. The new value of the URL patterns.
Method getNamesURLPatterns()
Gets the names of URLs.
Usage
FindUrlPipe$getNamesURLPatterns()
Returns
Value of names of URLs.
Method setNamesURLPatterns()
Sets the names of URLs.
Usage
FindUrlPipe$setNamesURLPatterns(namesURLPatterns)
Arguments
namesURLPatternsA
listvalue. The new value of the names of URLs.
Method clone()
The objects of this class are cloneable with this method.
Usage
FindUrlPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to find and/or remove the users on the data field of an Instance
Description
This class is responsible of detecting the existing use names in the
data field of each Instance. Identified user names are
stored inside the userName field of Instance class.
Moreover if required, is able to perform inline user name removement.
Details
The regular expressions indicated in the userPattern
variable are used to identify user names.
Note
FindUserNamePipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> FindUserNamePipe
Public fields
userPatternA
charactervalue. The regular expression to detect name users.
Methods
Public methods
Inherited methods
Method new()
Creates a FindEmoticonPipe object.
Usage
FindUserNamePipe$new( propertyName = "userName", alwaysBeforeDeps = list(), notAfterDeps = list(), removeUser = TRUE )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).removeUserA
logicalvalue. Indicates if the name users are removed.propertyLanguageNameA
charactervalue. Name of the language property.
Method pipe()
Preprocesses the Instance to obtain/remove
the name users. The emoticons found in the data are added to the
list of properties of the Instance.
Usage
FindUserNamePipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findUserName()
Finds the name users in the data.
Usage
FindUserNamePipe$findUserName(data)
Arguments
dataA
charactervalue. The text to search the name users.
Returns
The list with name users found.
Method removeUserName()
Removes the name users in the data.
Usage
FindUserNamePipe$removeUserName(data)
Arguments
dataA
charactervalue. The text where name users will be removed.
Returns
The data with the name users removed.
Method clone()
The objects of this class are cloneable with this method.
Usage
FindUserNamePipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Abstract super class that handles the management of the Pipes
Description
Provides the required methods to successfully handle each
GenericPipe class.
Methods
Public methods
Method new()
Creates a GenericPipe object.
Usage
GenericPipe$new(propertyName, alwaysBeforeDeps, notAfterDeps)
Arguments
Method pipe()
Abstract method to preprocess the Instance.
Usage
GenericPipe$pipe(instance)
Arguments
Returns
The preprocessed Instance.
Method getPropertyName()
Gets of name of property.
Usage
GenericPipe$getPropertyName()
Returns
Value of name of property.
Method getAlwaysBeforeDeps()
Gets of the dependencies always before.
Usage
GenericPipe$getAlwaysBeforeDeps()
Returns
Value of dependencies always before.
Method getNotAfterDeps()
Gets of the dependencies not after.
Usage
GenericPipe$getNotAfterDeps()
Returns
Value of dependencies not after.
Method setPropertyName()
Changes the value of property's name.
Usage
GenericPipe$setPropertyName(propertyName)
Arguments
propertyNameA
charactervalue. The new value of the property's name.
Method setAlwaysBeforeDeps()
Changes the value of dependencies always before.
Usage
GenericPipe$setAlwaysBeforeDeps(alwaysBeforeDeps)
Arguments
alwaysBeforeDepsA
listvalue. The new value of the dependencies always before.
Method setNotAfterDeps()
Changes the value of dependencies not after.
Usage
GenericPipe$setNotAfterDeps(notAfterDeps)
Arguments
notAfterDepsA
listvalue. The new value of the dependencies not after.
Method hash()
Generates an identification of pipe based on its fields.
Usage
GenericPipe$hash(algo = "md5")
Arguments
algoAlgorithm to be applied. Options: "md5", "sha1", "crc32", "sha256", "sha512", "xxhash32", "xxhash64", "murmur32", "spookyhash
Method clone()
The objects of this class are cloneable with this method.
Usage
GenericPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.log,
ContractionPipe, File2Pipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
ResourceHandler, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Abstract super class implementing the pipelining process
Description
Abstract super class to establish the flow of Pipes.
Methods
Public methods
Method new()
Creates a GenericPipeline object.
Usage
GenericPipeline$new()
Method execute()
Function where is implemented the flow of the
GenericPipes.
Usage
GenericPipeline$execute(instance)
Arguments
Returns
The preprocessed Instance.
Method get()
Gets a list with containing the set of GenericPipes
of the pipeline.
Usage
GenericPipeline$get()
Returns
The set of GenericPipes containing the pipeline.
Method toString()
Returns a character representing the pipeline.
Usage
GenericPipeline$toString()
Details
This function allows to set a place to define a character
representation of the structure of a pipeline.
Returns
GenericPipeline character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
GenericPipeline$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
bdpar.log, DefaultPipeline,
DynamicPipeline, Instance,
GenericPipe, %>|%
Class to obtain the date field of an Instance
Description
Obtains the date using the method which implements the
subclass of Instance.
Inherit
This class inherit from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> GuessDatePipe
Methods
Public methods
Inherited methods
Method new()
Creates a GuessDatePipe object.
Usage
GuessDatePipe$new(
propertyName = "date",
alwaysBeforeDeps = list("TargetAssigningPipe"),
notAfterDeps = list()
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).
Method pipe()
Preprocesses the Instance to obtain the date.
Usage
GuessDatePipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method clone()
The objects of this class are cloneable with this method.
Usage
GuessDatePipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to guess the language of an Instance
Description
This class allows guess the language by using language detector of library cld2. Creates the language property which indicates the idiom text.
Note
The Pipe will invalidate the Instance if the language of the data
can not be detect.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> GuessLanguagePipe
Methods
Public methods
Inherited methods
Method new()
Creates a GuessLanguagePipe object.
Usage
GuessLanguagePipe$new(
propertyName = "language",
alwaysBeforeDeps = list("StoreFileExtPipe", "TargetAssigningPipe"),
notAfterDeps = list()
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).
Method pipe()
Preprocesses the Instance to obtain the
language of the data.
Usage
GuessLanguagePipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method getLanguage()
Guesses the language of data.
Usage
GuessLanguagePipe$getLanguage(data)
Arguments
dataA
charactervalue. The text to guess the language.
Returns
The language guesser. Format: see ISO 639-3:2007.
Method clone()
The objects of this class are cloneable with this method.
Usage
GuessLanguagePipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.Options,
ContractionPipe, File2Pipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
Instance, InterjectionPipe,
MeasureLengthPipe, GenericPipe,
SlangPipe, StopWordPipe,
StoreFileExtPipe, TargetAssigningPipe,
TeeCSVPipe, ToLowerCasePipe
Abstract super class that handles the management of the Instances
Description
Provides the required methods to successfully handle each
Instance class.
Methods
Public methods
Method new()
Creates a Instance object.
Usage
Instance$new(path)
Arguments
pathA
charactervalue. Path of the file.
Method obtainDate()
Abstract function responsible for obtaining the date of the
Instance.
Usage
Instance$obtainDate()
Method obtainSource()
Abstract function responsible for determining the source of
the Instance.
Usage
Instance$obtainSource()
Method getDate()
Gets the date.
Usage
Instance$getDate()
Returns
Value of date.
Method getSource()
Gets the source.
Usage
Instance$getSource()
Returns
Value of source.
Method getPath()
Gets the path.
Usage
Instance$getPath()
Returns
Value of path.
Method getData()
Gets the data.
Usage
Instance$getData()
Returns
Value of data.
Method getProperties()
Gets the properties
Usage
Instance$getProperties()
Returns
Value of properties.
Method setSource()
Modifies the source value.
Usage
Instance$setSource(source)
Arguments
sourceA
charactervalue. The new value of source.
Method setData()
Modifies the data value.
Usage
Instance$setData(data)
Arguments
dataA
charactervalue. The new value of data.
Method setDate()
Modifies the date value.
Usage
Instance$setDate(date)
Arguments
dateA
charactervalue. The new value of date.
Method setProperties()
Modifies the properties value.
Usage
Instance$setProperties(properties)
Arguments
propertiesA
listvalue. The new list of properties.
Method addProperties()
Adds a property to the list of the properties.
Usage
Instance$addProperties(propertyValue, propertyName)
Arguments
propertyValueA
Objectvalue. The value of the new property.propertyNameA
charactervalue. The name of the new property.
Method getSpecificProperty()
Obtains a specific property.
Usage
Instance$getSpecificProperty(propertyName)
Arguments
propertyNameA
charactervalue. The name of the property to obtain.
Returns
The value of the specific property.
Method isSpecificProperty()
Checks for the existence of an specific property.
Usage
Instance$isSpecificProperty(propertyName)
Arguments
propertyNameA
charactervalue. The name of the property to check.
Returns
A logical results according to the existence of the specific property in the list of properties.
Method setSpecificProperty()
Modifies the value of the one property.
Usage
Instance$setSpecificProperty(propertyName, propertyValue)
Arguments
propertyNameA
charactervalue. The name of the property.propertyValueA
Objectvalue. The new value of the property.
Method getNamesOfProperties()
Gets of the names of all properties.
Usage
Instance$getNamesOfProperties()
Returns
The names of properties.
Method isInstanceValid()
Checks if the Instance is valid.
Usage
Instance$isInstanceValid()
Returns
Value of isValid flag.
Method invalidate()
Forces the invalidation of an specific Instance.
Usage
Instance$invalidate()
Method getFlowPipes()
Gets the list of the flow of GenericPipe.
Usage
Instance$getFlowPipes()
Returns
Names of the GenericPipe used.
Method addFlowPipes()
Gets the list of the flow of GenericPipe.
Usage
Instance$addFlowPipes(namePipe)
Arguments
namePipeA
charactervalue. Name of the newGenericPipeto be added in theGenericPipeline.
Method getBanPipes()
Gets an array with containing all the ban
GenericPipe.
Usage
Instance$getBanPipes()
Returns
Value of ban GenericPipe array.
Method addBanPipes()
Added the name of the Pipe to the array that keeps the track
of GenericPipes having running after restrictions.
Usage
Instance$addBanPipes(namePipe)
Arguments
namePipeA
charactervalue.GenericPipename to be introduced into the ban array.
Method checkCompatibility()
Check compatibility between GenericPipes.
Usage
Instance$checkCompatibility(namePipe, alwaysBefore)
Arguments
namePipeA
charactervalue. The name of theGenericPipename to check the compatibility.alwaysBeforeA
listvalue.GenericPipesthat theInstancehad to go through.
Method toString()
Returns a character representing the instance
Usage
Instance$toString()
Returns
Instance character representation
Method clone()
The objects of this class are cloneable with this method.
Usage
Instance$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
ExtractorEml, ExtractorSms,
ExtractorYtbid
Class to find and/or remove the interjections on the data field of an Instance
Description
InterjectionPipe class is responsible for detecting
the existing interjections in the data field of each Instance.
Identified interjections are stored inside the interjection field of
Instance class. Moreover if needed, is able to perform inline
interjections removement.
Details
InterjectionPipe class requires the resource files (in json format)
containing the list of interjections. To this end, the language of the text
indicated in the propertyLanguageName should be contained in the
resource file name (ie. interj.xxx.json where xxx is the value defined in the
propertyLanguageName ). The location of the resources should be
defined in the "resources.interjections.path" field of
bdpar.Options variable.
Note
InterjectionPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> InterjectionPipe
Methods
Public methods
Inherited methods
Method new()
Creates a InterjectionPipe object.
Usage
InterjectionPipe$new(
propertyName = "interjection",
propertyLanguageName = "language",
alwaysBeforeDeps = list("GuessLanguagePipe"),
notAfterDeps = list(),
removeInterjections = TRUE,
resourcesInterjectionsPath = NULL
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.propertyLanguageNameA
charactervalue. Name of the language property.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).removeInterjectionsA
logicalvalue. Indicates if the interjections are removed or not.resourcesInterjectionsPathA
charactervalue. Path of resource files (in json format) containing the interjections.
Method pipe()
Preprocesses the Instance to obtain/remove
the interjections. The interjections found in the data are added to the
list of properties of the Instance.
Usage
InterjectionPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findInterjection()
Checks if the interjection is in the data.
Usage
InterjectionPipe$findInterjection(data, interjection)
Arguments
Returns
A logical value depending on whether the
interjection is in the data.
Method removeInterjection()
Removes the interjection in the data.
Usage
InterjectionPipe$removeInterjection(interjection, data)
Arguments
Returns
The data with the interjections removed.
Method getPropertyLanguageName()
Gets the name of property language.
Usage
InterjectionPipe$getPropertyLanguageName()
Returns
Value of name of property language.
Method getResourcesInterjectionsPath()
Gets the path of interjections resources.
Usage
InterjectionPipe$getResourcesInterjectionsPath()
Returns
Value of path of interjections resources.
Method setResourcesInterjectionsPath()
Sets the path of interjections resources.
Usage
InterjectionPipe$setResourcesInterjectionsPath(path)
Arguments
pathA
charactervalue. The new value of the path of interjections resources.
Method clone()
The objects of this class are cloneable with this method.
Usage
InterjectionPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.Options,
ContractionPipe, File2Pipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
MeasureLengthPipe, GenericPipe,
ResourceHandler, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to obtain the length of the data field of an Instance
Description
This class is responsible of obtain the length of thedata
field of each Instance. Creates the length property
which indicates the length of the text. The property's name is customize
thought the class constructor.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> MeasureLengthPipe
Methods
Public methods
Inherited methods
Method new()
Creates a File2Pipe object.
Usage
MeasureLengthPipe$new( propertyName = "length", alwaysBeforeDeps = list(), notAfterDeps = list(), nchar_conf = TRUE )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).nchar_confA
logicalvalue. indicates if the pipe uses nchar or object.size.
Method pipe()
Preprocesses the Instance to obtain the
length of data.
Usage
MeasureLengthPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method getLength()
Preprocesses the Instance to obtain the
length of data.
Usage
MeasureLengthPipe$getLength(data, nchar_conf = TRUE)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method clone()
The objects of this class are cloneable with this method.
Usage
MeasureLengthPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessDatePipe, GuessLanguagePipe,
Instance, InterjectionPipe,
GenericPipe, ResourceHandler,
SlangPipe, StopWordPipe,
StoreFileExtPipe, TargetAssigningPipe,
TeeCSVPipe, ToLowerCasePipe
Class that handles different types of resources
Description
Class that handles different types of resources.
Details
It is a class that allows store the resources that are needed in the
GenericPipes to avoid having to repeatedly read from
the file. File resources of type json are read and stored in memory.
Methods
Public methods
Method new()
Creates a ResourceHandler object.
Usage
ResourceHandler$new()
Method isLoadResource()
From the resource path, it is checked if they have already been loaded. In this case, the list of the requested resource is returned. Otherwise, the resource variable is added to the list of resources, and the resource list is returned. In the event that the resource file does not exist, NULL is returned.
Usage
ResourceHandler$isLoadResource(pathResource)
Arguments
pathResourceA (character) value. The resource file path.
Returns
The resources list is returned, if they exist.
Method getResources()
Gets of resources variable.
Usage
ResourceHandler$getResources()
Returns
The value of resources variable.
Method setResources()
Sets of resources variable.
Usage
ResourceHandler$setResources(resources)
Arguments
resourcesThe new value of resources.
Method getNamesResources()
Gets of names of resources
Usage
ResourceHandler$getNamesResources()
Returns
Value of names of resources.
Method clone()
The objects of this class are cloneable with this method.
Usage
ResourceHandler$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
Class to find and/or replace the slangs on the data field of an Instance
Description
SlangPipe class is responsible for detecting
the existing slangs in the data field of each Instance.
Identified slangs are stored inside the slang field of
Instance class. Moreover if needed, is able to perform inline
slangs replacement.
Details
SlangPipe class requires the resource files (in json format)
containing the correspondence between slangs and meaning. To this end,
the language of the text indicated in the propertyLanguageName should
be contained in the resource file name (ie. slang.xxx.json where xxx is the
value defined in the propertyLanguageName ). The location of the
resources should be defined in the "resources.slangs.path" field of
bdpar.Options variable.
Note
SlangPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> SlangPipe
Methods
Public methods
Inherited methods
Method new()
Creates a SlangPipe object.
Usage
SlangPipe$new(
propertyName = "langpropname",
propertyLanguageName = "language",
alwaysBeforeDeps = list("GuessLanguagePipe"),
notAfterDeps = list(),
replaceSlangs = TRUE,
resourcesSlangsPath = NULL
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.propertyLanguageNameA
charactervalue. Name of the language property.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).replaceSlangsA
logicalvalue. Indicates if the slangs are replaced or not.resourcesSlangsPathA
charactervalue. Path of resource files (in json format) containing the correspondence between slangs and meaning.
Method pipe()
Preprocesses the Instance to obtain/replace
the slangs. The slangs found in the data are added to the
list of properties of the Instance.
Usage
SlangPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findSlang()
Checks if the slang is in the data.
Usage
SlangPipe$findSlang(data, slang)
Arguments
Returns
A logical value depending on whether the
slang is in the data.
Method replaceSlang()
Replaces the slang in the data for the extendedSlang.
Usage
SlangPipe$replaceSlang(slang, extendedSlang, data)
Arguments
Returns
The data with the slangs replaced.
Method getPropertyLanguageName()
Gets the name of property language.
Usage
SlangPipe$getPropertyLanguageName()
Returns
Value of name of property language.
Method getResourcesSlangsPath()
Gets the path of slangs resources.
Usage
SlangPipe$getResourcesSlangsPath()
Returns
Value of path of slangs resources.
Method setResourcesSlangsPath()
Sets the path of slangs resources.
Usage
SlangPipe$setResourcesSlangsPath(path)
Arguments
pathA
charactervalue. The new value of the path of slangs resources.
Method clone()
The objects of this class are cloneable with this method.
Usage
SlangPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.Options,
ContractionPipe, File2Pipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, ResourceHandler,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to find and/or remove the stop words on the data field of an Instance
Description
StopWordPipe class is responsible for detecting
the existing stop words in the data field of each Instance.
Identified stop words are stored inside the contraction field of
Instance class. Moreover if needed, is able to perform inline
stop words removement.
Details
StopWordPipe class requires the resource files (in json format)
containing the list of stop words. To this end, the language of the text
indicated in the propertyLanguageName should be contained in the
resource file name (ie. xxx.json where xxx is the value defined in the
propertyLanguageName ). The location of the resources should be
defined in the "resources.stopwords.path" field of
bdpar.Options variable.
Note
StopWordPipe will automatically invalidate the
Instance whenever the obtained data is empty.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> StopWordPipe
Methods
Public methods
Inherited methods
Method new()
Creates a StopWordPipe object.
Usage
StopWordPipe$new(
propertyName = "stopWord",
propertyLanguageName = "language",
alwaysBeforeDeps = list("GuessLanguagePipe"),
notAfterDeps = list("AbbreviationPipe"),
removeStopWords = TRUE,
resourcesStopWordsPath = NULL
)Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.propertyLanguageNameA
charactervalue. Name of the language property.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).removeStopWordsA
logicalvalue. Indicates if the stop words are removed or not.resourcesStopWordsPathA
charactervalue. Path of resource files (in json format) containing the stop words.
Method pipe()
Preprocesses the Instance to obtain/remove
the stop words. The stop words found in the data are added to the
list of properties of the Instance.
Usage
StopWordPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method findStopWord()
Checks if the stop word is in the data.
Usage
StopWordPipe$findStopWord(data, stopWord)
Arguments
Returns
A logical value depending on whether the
stop word is in the data.
Method removeStopWord()
Removes the stop word in the data.
Usage
StopWordPipe$removeStopWord(stopWord, data)
Arguments
Returns
The data with the stop words removed.
Method getPropertyLanguageName()
Gets the name of property language.
Usage
StopWordPipe$getPropertyLanguageName()
Returns
Value of name of property language.
Method getResourcesStopWordsPath()
Gets the path of stop words resources.
Usage
StopWordPipe$getResourcesStopWordsPath()
Returns
Value of path of stop words resources.
Method setResourcesStopWordsPath()
Sets the path of stop words resources.
Usage
StopWordPipe$setResourcesStopWordsPath(path)
Arguments
pathA
charactervalue. The new value of the path of stop words resources.
Method clone()
The objects of this class are cloneable with this method.
Usage
StopWordPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.Options,
ContractionPipe, File2Pipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, ResourceHandler,
SlangPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe,
ToLowerCasePipe
Class to get the file's extension field of an Instance
Description
Gets the extension of a file. Creates the extension property which indicates extension of the file.
Note
StoreFileExtPipe will automatically invalidate the
Instance if it is not able to find the
extension from the path field.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> StoreFileExtPipe
Methods
Public methods
Inherited methods
Method new()
Creates a StoreFileExtPipe object.
Usage
StoreFileExtPipe$new( propertyName = "extension", alwaysBeforeDeps = list(), notAfterDeps = list() )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).
Method pipe()
Preprocesses the Instance to obtain the
extension of Instance.
Usage
StoreFileExtPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method obtainExtension()
Gets of extension of the path.
Usage
StoreFileExtPipe$obtainExtension(path)
Arguments
pathA
charactervalue. The path of the file to get the extension.
Returns
Extension of the path.
Method clone()
The objects of this class are cloneable with this method.
Usage
StoreFileExtPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessDatePipe, GuessLanguagePipe,
Instance, InterjectionPipe,
MeasureLengthPipe, GenericPipe,
ResourceHandler, SlangPipe,
StopWordPipe, TargetAssigningPipe,
TeeCSVPipe, ToLowerCasePipe
Class to get the target field of the Instance
Description
This class allows searching in the path the target of
the Instance.
Details
The targets that are searched can be controlled through the constructor of the class where targetsName will be the string that is searched within the path and targets has the values that the property can take.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> TargetAssigningPipe
Methods
Public methods
Inherited methods
Method new()
Creates a TargetAssigningPipe object.
Usage
TargetAssigningPipe$new(
targets = list("ham", "spam"),
targetsName = list("_ham_", "_spam_"),
propertyName = "target",
alwaysBeforeDeps = list(),
notAfterDeps = list()
)Arguments
targetsA
listvalue. Name of the targets property.targetsNameA
listvalue. The name of folders.propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).
Method pipe()
Preprocesses the Instance to obtain the
target.
Usage
TargetAssigningPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method getTarget()
Gets the target from a path.
Usage
TargetAssigningPipe$getTarget(path)
Arguments
pathA
charactervalue. The path to analyze.
Returns
The target of the path.
Method checkTarget()
Checks if the target is in the path.
Usage
TargetAssigningPipe$checkTarget(target, path)
Arguments
Returns
if the target is found, returns target, else returns "".
Method getTargets()
Gets of targets.
Usage
TargetAssigningPipe$getTargets()
Returns
Value of targets.
Method clone()
The objects of this class are cloneable with this method.
Usage
TargetAssigningPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessDatePipe, GuessLanguagePipe,
Instance, InterjectionPipe,
MeasureLengthPipe, GenericPipe,
ResourceHandler, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TeeCSVPipe, ToLowerCasePipe
Class to handle a CSV with the properties field of the preprocessed Instance
Description
Complete a CSV with the properties of the preprocessed
Instance.
Details
The path to save the properties should be defined in the "teeCSVPipe.output.path" field of bdpar.Options variable.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> TeeCSVPipe
Methods
Public methods
Inherited methods
Method new()
Creates a TeeCSVPipe object.
Usage
TeeCSVPipe$new( propertyName = "", alwaysBeforeDeps = list(), notAfterDeps = list(), withData = TRUE, withSource = TRUE, outputPath = NULL )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).withDataA
logicalvalue. Indicates if the data is added to CSV.withSourceA
logicalvalue. Indicates if the source is added to CSV.outputPathA
charactervalue. The path of CSV.
Method pipe()
Completes the CSV with the preprocessed
Instance.
Usage
TeeCSVPipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method clone()
The objects of this class are cloneable with this method.
Usage
TeeCSVPipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, bdpar.Options,
ContractionPipe, File2Pipe,
FindEmojiPipe, FindEmoticonPipe,
FindHashtagPipe, FindUrlPipe,
FindUserNamePipe, GuessDatePipe,
GuessLanguagePipe, Instance,
InterjectionPipe, MeasureLengthPipe,
GenericPipe, ResourceHandler,
SlangPipe, StopWordPipe,
StoreFileExtPipe, TargetAssigningPipe,
ToLowerCasePipe
Class to convert the data field of an Instance to lower case
Description
Class to convert the data field of an Instance
to lower case.
Inherit
This class inherits from GenericPipe and implements the
pipe abstract function.
Super class
bdpar::GenericPipe -> ToLowerCasePipe
Methods
Public methods
Inherited methods
Method new()
Creates a ToLowerCasePipe object.
Usage
ToLowerCasePipe$new( propertyName = "", alwaysBeforeDeps = list(), notAfterDeps = list() )
Arguments
propertyNameA
charactervalue. Name of the property associated with theGenericPipe.alwaysBeforeDepsA
listvalue. The dependencies alwaysBefore (GenericPipesthat must be executed before this one).notAfterDepsA
listvalue. The dependencies notAfter (GenericPipesthat cannot be executed after this one).
Method pipe()
Preprocesses the Instance to convert the
data to lower case.
Usage
ToLowerCasePipe$pipe(instance)
Arguments
Returns
The Instance with the modifications that have
occurred in the pipe.
Method toLowerCase()
Converts the data to lower case
Usage
ToLowerCasePipe$toLowerCase(data)
Arguments
dataA
charactervalue. Text to preprocess.
Returns
The data in lower case.
Method clone()
The objects of this class are cloneable with this method.
Usage
ToLowerCasePipe$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
See Also
AbbreviationPipe, ContractionPipe,
File2Pipe, FindEmojiPipe,
FindEmoticonPipe, FindHashtagPipe,
FindUrlPipe, FindUserNamePipe,
GuessDatePipe, GuessLanguagePipe,
Instance, InterjectionPipe,
MeasureLengthPipe, GenericPipe,
ResourceHandler, SlangPipe,
StopWordPipe, StoreFileExtPipe,
TargetAssigningPipe, TeeCSVPipe
Object to handle the keys/attributes/options common to all pipeline flow
Description
This class provides the necessary methods to manage a list of keys or options used along the pipe flow, both those provided by the default library and those implemented by the user.
Usage
bdpar.Options
Details
By default, the application initializes the object named bdpar.Options
of type BdparOptions which is in charge of initializing the
options used in the defined pipes.
The default fields on bdpar.Options are initialized, if needed,
as shown bellow:
[eml]
- bdpar.Options$set("extractorEML.mpaPartSelected", <<PartSelectedOnMPAlternative>>)
[resources]
- bdpar.Options$set("resources.abbreviations.path", <<abbreviation.path>>)
- bdpar.Options$set("resources.contractions.path", <<contractions.path>>)
- bdpar.Options$set("resources.interjections.path", <<interjections.path>>)
- bdpar.Options$set("resources.slangs.path", <<slangs.path>>)
- bdpar.Options$set("resources.stopwords.path", <<stopwords.path>>)
[teeCSVPipe]
- bdpar.Options$set("teeCSVPipe.output.path", <<outputh.path>>)
[youtube]
- bdpar.Options$set("youtube.app.id", <<app_id>>)
- bdpar.Options$set("youtube.app.password", <<app_password>>)
- bdpar.Options$set("cache.youtube.path", <<cache.path>>)
[cache]
- bdpar.Options$set("cache", <<status_cache>>)
- bdpar.Options$set("cache.folder", <<cache.path>>)
[parallel]
- bdpar.Options$set("numCores", <<num_cores>>)
[verbose]
- bdpar.Options$set("verbose", <<status_verbose>>)
Cache functionality
If the bdpar cache is configured through the "cache" and "cache.folder" options, the status of the instances will be stored after each pipe. This allows to avoid rejections of previously executed tasks, if the order and configuration of the pipe and pipeline is the same as what is stored in the cache.
If you want to remove the cache, the cleanCache method does
this task.
Parallel functionality
The parallelization of instances is configured through the "numCores" option, which indicates the number of cores that will be used in the processing.
In the case of parallelisation, only the log by file will work to allow collecting all the information produced by the cores.
Log configuration
The bdpar log is configured through the configureLog function.
This system manages both the place to display the messages and the priority
level of each message showing only the messages with a higher level than
indicated in the threshold variable.
If you want to deactivate the bdpar log, the disableLog
method in bdpar.Options does this task.
Methods
- get:
-
obtains a specific option.
- Usage:
-
get(key) - Value:
-
the value of the specific option.
- Arguments:
-
- key:
-
(character) the name of the option to obtain.
- add:
-
adds a option to the list of options
- Usage:
-
add(key, value) - Arguments:
-
- key:
-
(character) the name of the new option.
- propertyName:
-
(Object) the value of the new option.
- set:
-
modifies the value of the one option.
- Usage:
-
set(key, value) - Arguments:
-
- key:
-
(character) the name of the new option.
- propertyName:
-
(Object) the value of the new option.
- remove:
-
removes a specific option.
- Usage:
-
remove(key) - Arguments:
-
- key:
-
(character) the name of the option to remove.
- getAll:
-
gets the list of options.
- Usage:
-
getAll() - Value:
-
Value of options.
- remove:
-
resets the option list to the initial state.
- Usage:
-
reset()
- isSpecificOption:
-
checks for the existence of an specific option.
- Usage:
-
isSpecificProperty(key) - Value:
-
A boolean results according to the existence of the specific option in the list of options
- Arguments:
-
- key:
-
(character) the key of the option to check.
- cleanCache:
-
Cleans the cache of executed pipelines. Deletes all files and directories that are in the path defined in "cache.folder" option.
- Usage:
-
cleanCache()
- configureLog:
-
Configures the bdpar log. In the case of parallelisation, only the log by file will work.
- Usage:
-
configureLog(console = TRUE, threshold = "INFO", file = NULL) - Arguments:
-
- console:
-
(boolean) Shows the log on console or not.
- threshold:
-
(character) The logging threshold level. Messages with a lower priority level will be discarded.
- file:
-
(character) The file to write messages to. If it is NULL, the log in file will not be enabled.
- disableLog:
-
Deactivates the bdpar log.
- Usage:
-
disableLog()
- getLogConfiguration:
-
Print the bdpar log configuration.
- Usage:
-
getLogConfiguration()
See Also
AbbreviationPipe, bdpar.log,
Connections, ContractionPipe,
ExtractorEml, ExtractorYtbid,
GuessLanguagePipe, Instance,
SlangPipe, StopWordPipe,
TeeCSVPipe, %>|%
Write messages to the log at a given priority level using the custom bdpar log
Description
bdpar.log is responsible for managing the messages to
show on the log.
Usage
bdpar.log(message, level = "INFO", className = NULL, methodName = NULL)
Arguments
message |
A string to be printed to the log with the corresponding priority level. |
level |
The desired priority level (DEBUG,INFO,WARN,ERROR and FATAL). In the case of the FATAL level will be call to the stop function. Also, if the level is WARN, the message will be a warning. |
className |
A string to indicated in which class is called to the log. If the value is NULL, this field is not shown in the log. |
methodName |
A string to indicated in which method is called to the log. If the value is NULL, this field is not shown in the log. |
Details
The format output is as following:
[currentTime][className][methodName][level] message
The type of message changes according to the level indicated:
- The DEBUG,INFO and ERROR levels return a text
using the message function.
- The WARN level returns a text using the warning function.
- The FATAL level returns a text using the stop function.
Note
In the case of multithreading, the log will only be by file.
See Also
Examples
## Not run:
# First step, configure the behavior of log
bdpar.options$configureLog(console = TRUE, threshold = "DEBUG", file = NULL)
message <- "Message example"
className <- "Class name example"
methodName <- "Method name example"
bdpar.log(message = message, level = "DEBUG", className = NULL, methodName = NULL)
bdpar.log(message = message, level = "INFO", className = className, methodName = methodName)
bdpar.log(message = message, level = "WARN", className = className, methodName = NULL)
bdpar.log(message = message, level = "ERROR", className = NULL, methodName = NULL)
bdpar.log(message = message, level = "FATAL", className = NULL, methodName = methodName)
## End(Not run)
Example of the content of the files to be preprocessed.
Description
A manually collected data set containing e-mails and SMS messages from the nutritional and health domain classified as spam and non-spam (with a ratio of 50%). In addition the dataset contains two variables: (i) path which indicates the location of the target file and, (ii) source which contains the raw text comprising each file.
Usage
data(bdparData)
Format
A data frame with 20 rows and 2 variables:
- path
File path.
- source
File content.
Emojis codes and descriptions data.
Description
This data comes from "Unicode.org", <http://unicode.org/emoji/charts/full-emoji-list.html>. The data are codes and descriptions of Emojis.
Usage
data(emojisData)
Format
A data frame with 2623 rows and 2 variables:
- code
Emoji code
- description
Emoji description.
bdpar customized forward-pipe operator
Description
Defines a customized forward pipe operator extending the
features of classical %>%. Concretely %>|% is able to stop the pipelining
process whenever an Instance has been invalidated. This issue,
avoids executing the whole pipelining process for the invalidated
Instance and therefore reduce the time and resources used to
complete the whole process.
Usage
lhs %>|% rhs
Arguments
lhs |
an |
rhs |
a function call using the bdpar semantics. |
Value
The Instance modified by the methods it has traversed.
Details
This is the %>% operator of the modified magrittr library to both
(i) to stop the flow when the Instance is invalid and (ii)
automatically call the pipe function of the R6 objects passing
through it (iii) to check the dependencies of the Instance and
(iv) to manage the pipeline cache.
The usage structure would be as shown below:
instance %>|% pipeObject$new() %>|% pipeObject$new(<<argument1>>, <<argument2>, ...) %>|% pipeObject$new()
Note
Pipelining process is automatically stopped if the Instance
is invalid.
See Also
bdpar.Options, Instance,
GenericPipe
Initiates the pipelining process
Description
runPipeline is responsible for easily initialize the pipelining preprocessing process.
Usage
runPipeline(path, extractors = ExtractorFactory$new(),
pipeline = DefaultPipeline$new(), cache = TRUE, verbose = FALSE, summary = FALSE)
Arguments
path |
(character) path where the files to be preprocessed are located. |
extractors |
(ExtractorFactory) object implementing
the method |
pipeline |
(GenericPipeline) subclass of |
cache |
(logical) flag indicating if the status of the instances will be stored after each pipe. This allows to avoid rejections of previously executed tasks, if the order and configuration of the pipe and pipeline is the same as what is stored in the cache. |
verbose |
(logical) flag indicating for printing messages, warnings and errors. |
summary |
(logical) flag indicating if a summary of the pipeline execution is provided or not. |
Value
List of Instance that have been preprocessed.
Details
In the case that some pipe, defined on the workflow, needs some type of configuration, it can be defined thought bdpar.Options variable which have different methods to support the functionality of different pipes.
See Also
Bdpar, bdpar.Options,
Connections, DefaultPipeline,
DynamicPipeline, GenericPipeline,
Instance, ExtractorFactory,
ResourceHandler
Examples
## Not run:
#If it is necessary to indicate any existing configuration key, do it through:
#bdpar.Options$set(key, value)
#If the key is not initialized, do it through:
#bdpar.Options$add(key, value)
#If it is neccesary parallelize, do it through:
#bdpar.Options$set("numCores", numCores)
#If it is necessary to change the behavior of the log, do it through:
#bdpar.Options$configureLog(console = TRUE, threshold = "INFO", file = NULL)
#Folder with the files to preprocess
path <- system.file("example",
package = "bdpar")
#Object which decides how creates the instances
extractors <- ExtractorFactory$new()
#Object which indicates the pipes' flow
pipeline <- DefaultPipeline$new()
#Starting file preprocessing...
runPipeline(path = path,
extractors = extractors,
pipeline = pipeline,
cache = FALSE,
verbose = FALSE,
summary = TRUE)
## End(Not run)