library(arete)
#> Can't find a default virtual environment for arete. If this is your first time loading the package, please run arete_setup().
#>
#> Attaching package: 'arete'
#> The following object is masked from 'package:base':
#>
#> labels
Let’s say you want to extract data from a paper, normally you’d run something that looks like this:
geotest = arete::get_geodata(
path = file_path,
user_key = list(key = "your key here!", premium = TRUE),
model = "gpt-4o",
outpath = "/your/path/here"
)
As the extraction process depends on an internet connection and your own personal user key, this won’t run. Instead we will open a csv with pre-run results. But feel free to try it! get_geodata generates one csv file per pdf in its input parameter. In our example data we have already collected all csvs under a single table.
Species | Location | Coordinates | ID | Type |
---|---|---|---|---|
Araneus holzapfelae | Limpopo: Blouberg Nature Reserve | -22.99, 29.04 | 1 | Ground truth |
Araneus holzapfelae | Little Leigh Farm, Louis Trichardt | -22.949, 29.870 | 1 | Ground truth |
Araneus holzapfelae | Mpumalanga: Brondal | -25.35, 30.84 | 1 | Ground truth |
Araneus holzapfelae | Mpumalanga: Pretoriuskop | -25.123, 32.237 | 2 | Ground truth |
Araneus holzapfelae | Gauteng: Ezemvelo Nature Reserve | -25.80, 28.77 | 2 | Ground truth |
Anapistula ataecina | Gauteng: Faerie Glen Nature Reserve | -25.74, 28.19 | 2 | Ground truth |
Araneus holzapfelae | KwaZulu-Natal: Empangeni | -28.72, 31.88 | 3 | Ground truth |
Araneus holzapfelae | KwaZulu-Natal: Richards Bay | -28.78, 32.10 | 3 | Ground truth |
Araneus holzapfelae | KwaZulu-Natal: iSimangaliso Wetland Park, uMkhuze Game Reserve | -27.63, 32.25 | 3 | Ground truth |
Araneus holzapfelae | KwaZulu-Natal: uMkhuze Game Reserve | -27.62174, 32.24543 | 4 | Ground truth |
Araneus holzapfelae | KwaZulu-Natal: Isandlwane Nature Reserve | -28.359, 30.640 | 4 | Ground truth |
Araneus holzapfelae | KwaZulu-Natal: Wakefield Farm | -29.4987, 29.9106 | 4 | Ground truth |
Araneus holzapfelae | Limpopo: Blouberg | -22.99, 29.04 | 1 | Model |
A. holzapfelae | Little Leigh Farm, Louis Trichardt | -22.949, 29.870 | 1 | Model |
Araneus holzapfelae | Mpumalanga: Brondal | -25.35, 30.84 | 1 | Model |
Araneus holzapfelae | Mpumalanga: Pretoriuskop | -25.123, 32.237 | 2 | Model |
Araneus holzapfelae | Gauteng: Ezemvelo Nature Reserve | -25.85, 28.78 | 2 | Model |
Araneus holzapfelae | Faerie Glen | -25.74, 28.19 | 2 | Model |
Araneus holzapfelae | KwaZulu-Natal: Empangeni | -28.72, 31.88 | 3 | Model |
Araneus holzapfelae | KwaZulu-Natal: Richards Bay | -29.78, 32.10 | 3 | Model |
Araneus holzapfelae | Game Reserve | -27.62174, 32.24543 | 4 | Model |
Macrothele calpeiana | KwaZulu-Natal: Isandlwane Nature Reserve | -28.359, 30.640 | 4 | Model |
Araneus holzapfelae | KwaZulu-Natal: Wakefield Farm | -29.4987, 29.9106 | 4 | Model |
In this case we will be as careful as possible and go over outliers
separately from get_geodata()
. This is a good example of
the limitations of the process: geo_geodata()
can
automatically do the next step for you but in situations where for some
reason coordinates are written in text as latitude longitude instead of
longitude latitude, some outlier detection methods (env, svm) will
fail.
Let’s start by converting all of the coordinates from text to numeric values.
geocoords = string_to_coords(geotest$Coordinates)
#> 23 out of 23 (100%) succeded.
kableExtra::kable(geocoords)
Lat | Long |
---|---|
-22.99000 | 29.04000 |
-22.94900 | 29.87000 |
-25.35000 | 30.84000 |
-25.12300 | 32.23700 |
-25.80000 | 28.77000 |
-25.74000 | 28.19000 |
-28.72000 | 31.88000 |
-28.78000 | 32.10000 |
-27.63000 | 32.25000 |
-27.62174 | 32.24543 |
-28.35900 | 30.64000 |
-29.49870 | 29.91060 |
-22.99000 | 29.04000 |
-22.94900 | 29.87000 |
-25.35000 | 30.84000 |
-25.12300 | 32.23700 |
-25.85000 | 28.78000 |
-25.74000 | 28.19000 |
-28.72000 | 31.88000 |
-29.78000 | 32.10000 |
-27.62174 | 32.24543 |
-28.35900 | 30.64000 |
-29.49870 | 29.91060 |
Often species names between human extracted data and model extracted data will not match, for example as a result of humans using species’ abbreviated name as opposed to its full name. Additionally models will sometimes erratically and add characters that might go undetected, especially if OCR extracted text was used. In order to have a good idea of model performance it is then often important to standardize species names. Here is an example for paper 1 in our dataset:
geonames = data.frame(
human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
)
mismatch = c(1:nrow(geonames))[geonames$human_names != geonames$model_names]
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "red")
geonames
human_names | model_names |
---|---|
Araneus holzapfelae | Araneus holzapfelae |
Araneus holzapfelae | A. holzapfelae |
Araneus holzapfelae | Araneus holzapfelae |
By using process_species_names()
we standardize our
species names and our data is correctly associated as referring to the
same species.
geotest$Species = process_species_names(geotest$Species)
geonames = data.frame(
human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
)
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "green")
geonames
human_names | model_names |
---|---|
|
|
|
|
|
|
Often it pays off to be suspicious of data generated automatically
through machine learning (one could argue this true of human generated
data as well). For this we’ll use the utilities in package gecko,
which arete calls. In order for it to work, gecko needs to be
setup which we recommend you do after reading the documentation of
functions gecko::gecko.setDir()
and
gecko::gecko.worldclim()
. Setup will require a one-time
potentially heavy download of an environmental dataset, WorldClim. Function
gecko::outliers.detect will use this data to determine which points are
likely outliers through different methods, including calculating the
environmental and geographic distance between points and training a
support vector machine model on supplied data. The outcome of these
methods are collected in separate columns and the total number of
methods suggesting a given point as an outlier is shown in column
possible.outliers
We then have:
geoout = gecko::outliers.detect(geocoords[2:1])
#> All dimensions are missing at least one value. Trying rows.
kableExtra::kable(geoout)
x_coords | y_coords | env | geo | possible.outliers |
---|---|---|---|---|
29.04000 | -22.99000 | FALSE | TRUE | 1 |
29.87000 | -22.94900 | FALSE | FALSE | 0 |
30.84000 | -25.35000 | FALSE | FALSE | 0 |
32.23700 | -25.12300 | FALSE | FALSE | 0 |
28.77000 | -25.80000 | FALSE | FALSE | 0 |
28.19000 | -25.74000 | FALSE | FALSE | 0 |
31.88000 | -28.72000 | TRUE | FALSE | 1 |
32.10000 | -28.78000 | TRUE | FALSE | 1 |
32.25000 | -27.63000 | FALSE | FALSE | 0 |
32.24543 | -27.62174 | FALSE | FALSE | 0 |
30.64000 | -28.35900 | FALSE | FALSE | 0 |
29.91060 | -29.49870 | FALSE | FALSE | 0 |
29.04000 | -22.99000 | FALSE | TRUE | 1 |
29.87000 | -22.94900 | FALSE | FALSE | 0 |
30.84000 | -25.35000 | FALSE | FALSE | 0 |
32.23700 | -25.12300 | FALSE | FALSE | 0 |
28.78000 | -25.85000 | FALSE | FALSE | 0 |
28.19000 | -25.74000 | FALSE | FALSE | 0 |
31.88000 | -28.72000 | TRUE | FALSE | 1 |
32.10000 | -29.78000 | NA | FALSE | NA |
32.24543 | -27.62174 | FALSE | FALSE | 0 |
30.64000 | -28.35900 | FALSE | FALSE | 0 |
29.91060 | -29.49870 | FALSE | FALSE | 0 |
Finally, we can determine how our model performed by processing all
of our data through function performance_report()
. This
function takes two initial tables of equal formatting, one of human
extracted data and another of model extracted data and computes a series
of metrics that are helpful to get a sense of where mistakes might be
found.
geotest = cbind(geotest[,1:2], geocoords, geotest[,4:5])
geotest = list(
GT = geotest[geotest$Type == "Ground truth", 1:5],
MD = geotest[geotest$Type == "Model", 1:5]
)
geo_report = performance_report(geotest$GT, geotest$MD, full_locations = "both", verbose = FALSE, rmds = FALSE)
For locations, the Levenshtein distance is calculated between terms.
For coordinates, it creates one confusion matrix for every species in
common between sets. These are composed of True Positives (TP, perfectly
matching coordinates from both tables), False Positives (FP, coordinates
showing up only on the model extracted data) and False Negatives (FN,
coordinates showing up only on the human extracted data). True Negatives
are assumed to not apply. Several metrics are then calculated using the
confusion matrix, including accuracy, precision, recall and the F1
score, the details of which can be found in the documentation of
performance_report()
. An additional global confusion matrix
is created which also includes errors (FP and FN) that are the result of
species unique to each set. More metrics appear on the extended reports
created through rmds = FALSE
, including versions of these
already mentioned metrics that are weighed by the degree of error being
shown. i.e., if the model hallucinates a data point that is
close to existing points its weight as a False Positive is less than if
it hallucinated a data point completely different from all other
points.
geo_report
#> $levenshtein
#> nchar mean_minimum_levenshtein file
#> 1 38 20 1
#> 2 41 0 1
#> 3 38 0 1
#> 4 38 0 2
#> 5 41 0 2
#> 6 38 11 2
#> 7 38 0 3
#> 8 41 0 3
#> 9 38 38 3
#> 10 38 15 4
#> 11 41 0 4
#> 12 38 0 4
#>
#> $mean_minimum_levenshtein
#> 1 2 3 4
#> 6.666667 3.666667 12.666667 5.000000
#>
#> $adjusted_m_m_levenshtein
#> 1 2 3 4
#> 0.17543860 0.09649123 0.33333333 0.13157895
#>
#> $`1_a. holzapfelae`
#> TRUE FALSE
#> TRUE 3 0
#> FALSE 0 NA
#>
#> $exclusive_to_each_set
#> set file species count
#> [1,] "human" "2" "a. ataecina" "1"
#> [2,] "model" "4" "m. calpeiana" "1"
#>
#> $`2_a. holzapfelae`
#> TRUE FALSE
#> TRUE 1 2
#> FALSE 1 NA
#>
#> $`3_a. holzapfelae`
#> TRUE FALSE
#> TRUE 1 0
#> FALSE 1 NA
#>
#> $`4_a. holzapfelae`
#> TRUE FALSE
#> TRUE 2 0
#> FALSE 1 NA
#>
#> $global
#> TRUE FALSE
#> TRUE 7 3
#> FALSE 4 NA