HandwriterRF has a pre-trained random forest and set of reference
similarity scores that are the default for
compare_documents()
and
compare_writer_profiles()
. This tutorial shows you how to
train your own random forest and create your own set of reference scores
to use with these functions.
You need scanned handwriting samples saved as PNG images for training the random forest and making reference scores. The training set must include at least two samples from each writer so that the random forest can see examples of documents written by the same writer and examples of documents written by different writers.
The CSAFE Handwriting Database contains suitable handwriting samples that you may download for free if you don’t have your own samples.
Place handwriting samples that you will use to train a random forest
in a folder. The first step is to estimate a writer profile from each
handwriting sample. We do this with
handwriter::get_writer_profiles()
. Behind the scenes,
handwriter::get_writer_profiles()
performs the following
steps for each sample:
handwriter::processDocument()
.handwriter::make_clustering_template()
. By default,
handwriter::get_writer_profiles()
uses the cluster template
templateK40
included with handwriter. You may create your
own cluster template if you prefer.handwriter::get_cluster_fill_rates()
. The cluster fill
rates serve as an estimate of a writer profile for the writer of the
document.Load handwriter and handwriterRF.
Calculate writer profiles for the training samples with
templateK40
. The output is a dataframe.
Now that we have writer profiles, we can train a random forest.
train_rf()
performs the following steps:
?train_rf
for more information about these measures.When running train_rf()
you have a several choices to
make:
ntrees = 200
produced good
results.output_dir
argument,
the random forest will be returned but not saved to your computer.downsample_diff_pairs = TRUE
. This
randomly samples the different writers distances to equal the
number of same writer distances.rf <- train_rf(
df = profiles,
ntrees = 200,
distance_measures = c("abs", "man", "euc", "max", "cos"),
output_dir = "path/to/output/folder",
downsample_diff_pairs = TRUE
)
If you would like to train a series of random forests with
lapply
or a for loop, use the run number and output
directory arguments. The run number is added to the file name when the
random forest is saved, so that subsequent random forests are not saved
over the previous ones.
The functions compare_documents()
and
compare_writer_profiles()
either return a similarity score
or a score-based likelihood. Both express how similar or not two
handwriting samples are to each other.
The score-based likelihood ratio (SLR) builds upon the observed similarity score by comparing it to reference same writer and different writers similarity scores. The SLR is the ratio of the likelihood of observing the similarity score if the samples where written by the same writer to the likelihood of observing the similarity score if the samples where written by the different writers.
If compare_documents()
and
compare_writer_profiles()
only return the similarity score,
reference scores are not used. But if these functions calculate an SLR
they need reference scores. HandwriterRF includes a set of reference
score as ref_scores
for use with these functions, but you
can also create your own set of reference scores.
Refer to the sections above to obtain suitable training samples and estimate writer profiles.
ref_profiles <- handwriter::get_writer_profiles(
input_dir = "path/to/ref/samples/folder",
measure = "rates",
num_cores = 1,
template = handwriter::templateK40,
output_dir = "path/to/output/folder"
)
rscores <- get_ref_scores(rforest = rf,
df = ref_profiles)
We can plot the built-in reference scores in a way similar to a
histogram. These scores range from 0 to 1, inclusive. The
plot_scores()
function divides this range into bins and
calculates the proportion of scores that fall into each bin. Normally, a
histogram would show the count of scores in each bin. However, since
there are many more different writers scores than same writer scores,
the histogram for different writers scores dominates, making the same
writer histogram hard to see. To fix this, we plot the proportion (rate)
of scores in each bin instead of the raw frequency, which balances the
two histograms and makes both more visible.
If we want to see how an observed score compares to the same
writer and different writers scores, we use the
obs_score
argument. For example, if the observed score is
0.2, we plot
You can also plot your own reference scores.
In this section, we will use the new random forest and reference scores to compare two handwritten documents. As before, the handwriting samples need to be scanned and saved as PNG files. Do not use samples or writers that were used to create the random forest or the reference scores, as this may bias the results.
First, compare the two documents with the default random forest and
reference scores. As an example, we use two handwriting samples included
in handwriterRF. The system.file()
function finds the
location of the handwriterRF package on your computer. We use
score_only = FALSE
to return an SLR.
sample1 <- system.file("extdata", "docs", "w0238_s01_pWOZ_r02.png", package = "handwriterRF")
sample2 <- system.file("extdata", "docs", "w0238_s01_pWOZ_r03.png", package = "handwriterRF")
df <- compare_documents(
sample1,
sample2,
score_only = FALSE
)
df
#> docname1 writer1 docname2 writer2 score slr
#> 1 w0238_s01_pWOZ_r02 unknown1 w0238_s01_pWOZ_r03 unknown2 0.98 130.0626
The SLR is greater than one, which means the similarity score is more like the reference same writer scores than the different writers scores. We plot the observed score with the reference scores.
Next, compare the same documents with the new random forest and reference scores and plot the obeserved score.