Help for package KOR.addrlink

Type:

Package

Title:

Matching Address Data to Reference Index

Version:

1.0.1

Date:

2024-03-02

Author:

Daniel Schürmann [aut, cre]

Maintainer:

Daniel Schürmann <d.schuermann@2718282.net>

Depends:

R (≥ 3.4)

Imports:

stringdist, stringi

LazyData:

true

Description:

Matches a data set with semi-structured address data, e.g., street and house number as a concatenated string, wrongly spelled street names or non-existing house numbers to a reference index. The methods are specifically designed for German municipalities ('KOR'-community) and German address schemes.

License:

GPL-3

Encoding:

UTF-8

URL:

https://git-kor.stadtdo.de

BugReports:

https://git-kor.stadtdo.de/stadt-dortmund/adressdaten/-/issues

NeedsCompilation:

Packaged:

2024-03-02 14:22:21 UTC; scd01

Repository:

CRAN

Date/Publication:

2024-03-05 11:10:13 UTC

KOR.addrlink

Description

Geocode address data from German municipalities

Details

split_address Splits strings into street, house number and addional letter
split_number Splits strings into house number and addional letter
addrlink Matches splitted address data to reference table

Matching is based on street name, house number and additional letter.

Author(s)

Daniel Schürmann

Address data from the city of Dortmund

Description

This data set gives all the addresses in the city of Dortmund.

Usage

Adressen

Format

A data.frame

STRNAME	character	street name
STRSL	numeric	street number
HNR	numeric	house number
HNRZ	character	additional letter
RW	numeric	longitude
HW	numeric	latitude
UBZ	numeric	subdistrict number

Source

https://open-data.dortmund.de

Merge Data To Reference Index

Description

Takes two data.frames with address data and merges them together.

Usage

addrlink(df_ref, df_match, 
col_ref = c("Strasse", "Hausnummer", "Hausnummernzusatz"), 
col_match = c("Strasse", "Hausnummer", "Hausnummernzusatz"), 
fuzzy_threshold = 0.9, seed = 1234)

Arguments

df_ref

data.frame with address references

df_match

data.frame with addresses to be matched

col_ref

character vector of length three, naming the df_ref columns which contain the steet names, house numbers and additional letters (in that order)

col_match

character vector of length three, naming the df_match columns which contain the steet names, house numbers and additional letters (in that order)

fuzzy_threshold

The threshold used for fuzzy matching street names

seed

Seed for random numbers

Details

The matching is done in four stages.

Stage 1 (qAdress = 1). This is an exact match (highest quality, qscore = 1)

Stage 2 (qAdress = 2). Exact match on street name, but no valid house number could be found. Be aware that random house numbers might be used. Consider setting your own seed. qscore indicates the match quality. See match_number for details.

Stage 3 (qAdress = 3). No exact match on street name could be found. Street names are fuzzy matched. The method "jw" (Jaro-Winkler distance) from package stringdist is used (see stringdist-metrics). If 1 - [Jaro-Winkler distance] is greater than fuzzy_threshold, a match is assumed. The highest score is taken and house number matching is done as outlined in Stage 2. qscore is fuzzy_score*[house number score].

Stage 4 (qAdress = 4). No match (qscore = 0)

Value

A list

ret

The merged dataset

QA

The quality markers (qAdress and qscore)

Author(s)

Daniel Schürmann

Example dataset 1

Description

This dataset contains separate street and house number information.

Usage

df1

Format

A data.frame

gross_strasse	character	street names
hausnr	character	house number and additional letter
Var1	numeric	Variable 1
Var2	character	Variable 2

Source

Dortmunder Statistik

Example dataset 2

Description

This dataset contains concatenated street and house number information.

Usage

df2

Format

A data.frame

Adresse	character	street name, house number and addional letter
Var1	numeric	Variable 1
Var2	character	Variable 2

Source

Dortmunder Statistik

Splits A Single Address Into Street, House Number And Additional Letter

Description

This is an internal function. Please use split_address

Usage

helper_split_address(x, debug = FALSE)

Arguments

x

A character vector of length 1

debug

If true, print(x)

Value

A list with three elements

strasse

Extracted street name

hnr

Extracted house number

hnrz

Extracted extra letter

Author(s)

Daniel Schürmann

Splits A Single House Number Into House Number And Additional Letter

Description

This is an internal function. Please use split_number

Usage

helper_split_number(x, debug = FALSE)

Arguments

x

A character vector of length 1

debug

If true, print(x)

Value

A data.frame with two elements

Hausnummer

Extracted house number

Zusatz

Extracted extra letter

Author(s)

Daniel Schürmann

Calculate L1-Distance Based Scores

Description

Reversed normalized absolute distance from zero.

Usage

l1score(x)

Arguments

x

A numeric vector

Details

1 - \frac{|x|}{\text{max}\{1, |x|\}}

Value

A numeric vector of the same length as x

Author(s)

Daniel Schürmann

Find Best House Number Match Within Given Street

Description

This is an internal function. Please use addrlink

Usage

match_number(record, Adressen, weights = c(0.9, 0.1))

Arguments

record

data.frame with one row and three columns (Strasse, Hausnummer, Hausnummernzusatz)

Adressen

data.frame of all valid addresses (same columns as record data.frame)

weights

The weighing factors between house number and additional letter

Details

If no house number and no additional letter is provided, a random address in the given street is selected (qscore = 0).

If only an additional letter but no house number is given and the letter is unique, returns the corresponding record (qscore = 0.05). Otherwise returns a random one as mentioned above (qscore = 0).

If no additional letter, but house number is provided and the maximum distance to a valid house number is 4, return the closest match as calculated by l1score (qscore is the result of l1score). Otherwise a random record is returned (qscore = 0).

If additional letter and house number are available and the house number distance is smaller then 4, calculates the l1scores of the house number distance and addional letters distance and selects the best match (qscore is the sum of both weighted l1scores). Otherwise a random record is selected (qscore = 0).

Value

A data.frame

qscore

The quality score of the match

Strasse

matched street

Hausnummer

matched house number

Hausnummernzusatz

matched additional letter

Author(s)

Daniel Schürmann

Clean Steet Names And Make Them Mergeable

Description

This function replaces Umlauts, expands "str" to "strasse", transliterates all non-ascii characters, removes punctuation and converts to lower case.

Usage

sanitize_street(x)

Arguments

x

A character vector containing the steet names

Details

This is an internal function used in addrlink. Make sure house numbers have already been extracted. Use split_number or split_address for that. Only steet names can go into sanitize_street.

Value

A character vector of the same length as x containing the sanitized street names.

Author(s)

Daniel Schürmann

Split Adresses Into Street, House Number And Additional Letter

Description

This function takes a character vector where each element is made up from a concatenation of street name, house number and possibly an additional letter and splits it into its parts.

Usage

split_address(x, debug = FALSE)

Arguments

x

A character vector

debug

If true, all records will be printed to the console

Details

If the function fails, consider using debug = TRUE. This will print the record, which caused the error. Consider filing an issue on the linked git project (see DESCRIPTION).

Value

A data.frame with three columns

Strasse

A character column containing the extracted street names

Hausnummer

House number

Hausnummernzusatz

Additional letter

Note

For a more advanced, general purpose solution see libpostal.

Author(s)

Daniel Schürmann

Examples

split_address(c("Teststr. 8-9 a", "Erster Weg 1-2", "Ahornallee 100a-102c"))

Split house number into house number and additional letter

Description

This function takes a character vector where each element is made up from a concatenation of house number and possibly an additional letter and splits is into its parts.

Usage

split_number(x, debug = FALSE)

Arguments

x

A character vector

debug

If true, all records will be printed to the console

Details

If the function fails, consider using debug = TRUE. This will print the record, which caused the error. Consider filing an issue on the linked git project (see DESCRIPTION).

Value

A data.frame with two columns

Hausnummer

House number

Hausnummernzusatz

Additional letter

Note

For a more advanced, general purpose solution see libpostal.

Author(s)

Daniel Schürmann

Examples

split_number(c("8-9 a", "1-2", "100a-102c"))

KOR.addrlink

Description

Details

Author(s)

Address data from the city of Dortmund

Description

Usage

Format

Source

Merge Data To Reference Index

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Example dataset 1

Description

Usage

Format

Source

Example dataset 2

Description

Usage

Format

Source

Splits A Single Address Into Street, House Number And Additional Letter

Description

Usage

Arguments

Value

Author(s)

See Also

Splits A Single House Number Into House Number And Additional Letter

Description

Usage

Arguments

Value

Author(s)

See Also

Calculate L1-Distance Based Scores

Description

Usage

Arguments

Details

Value

Author(s)

Find Best House Number Match Within Given Street

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Clean Steet Names And Make Them Mergeable

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Split Adresses Into Street, House Number And Additional Letter

Description

Usage

Arguments

Details

Value

Note

Author(s)

See Also

Examples

Split house number into house number and additional letter

Description

Usage

Arguments

Details

Value