Help for package lightsf

Type:

Package

Title:

A Curated Collection of Georeferenced and Spatial Datasets

Version:

0.1.0

Maintainer:

Ingrid Romero Pinilla <ingridpinilla11@gmail.com>

Description:

Provides a diverse collection of georeferenced and spatial datasets from different domains including urban studies, housing markets, environmental monitoring, transportation, and socio-economic indicators. The package consolidates datasets from multiple open sources such as Kaggle, chopin, spData, adespatial, and bivariateLeaflet. It is designed for researchers, analysts, and educators interested in spatial analysis, geostatistics, and geographic data visualization. The datasets include point patterns, polygons, socio-economic data frames, and network-like structures, allowing flexible exploration of geospatial phenomena.

License:

GPL-3

URL:

https://github.com/roming20/lightsf, https://roming20.github.io/lightsf/

BugReports:

https://github.com/roming20/lightsf/issues

Encoding:

UTF-8

LazyData:

true

Suggests:

ggplot2, dplyr, testthat (≥ 3.0.0), knitr, rmarkdown

RoxygenNote:

7.3.2

Config/testthat/edition:

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2025-10-14 03:40:58 UTC; ingri

Author:

Ingrid Romero Pinilla [aut, cre]

Depends:

R (≥ 3.5.0)

Repository:

CRAN

Date/Publication:

2025-10-19 13:10:02 UTC

lightsf: Collection of georeferenced and spatial datasets from different domains

Description

Details

lightsf - Collection of georeferenced and spatial datasets from different domains.

Collection of georeferenced and spatial datasets from different domains.

Author(s)

Maintainer: Ingrid Romero Pinilla ingridpinilla11@gmail.com

Spatial Patterns of Conflict in Africa (1966–1978)

Description

This dataset, 'afcon_poly', is a data frame summarizing spatial patterns of conflict across 42 African countries between 1966 and 1978. The dataset was originally used in Anselin (1995) to study spatial autocorrelation in political conflict. It excludes South West Africa, Spanish Equatorial Africa, and Spanish Sahara. The dataset includes centroid coordinates, country names, and the total number of recorded conflicts during this period.

Usage

data(afcon_poly)

Format

A data frame with 42 observations and 5 variables:

x: Longitude coordinate of the country centroid (numeric)
y: Latitude coordinate of the country centroid (numeric)
totcon: Total number of conflicts recorded, 1966–1978 (numeric)
name: Name of the country (factor with 42 levels)
id: Numeric country identifier (numeric)

Details

The dataset consists of 42 observations (countries) and 5 variables.

The dataset name has been kept as 'afcon_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4.

References

Anselin, L. (1995). Local Indicators of Spatial Association—LISA. *Geographical Analysis*, 27(2), 93–115.

Georeferenced Pedestrian Car Collisions (2015, Santiago de Chile)

Description

This dataset, atropellados_pts, is a data frame containing information on pedestrian car collisions that occurred in Santiago de Chile in 2015. Each record includes the geographical coordinates of the accident, location description, and the number of victims categorized by severity (fatal, serious, less serious, and minor).

Usage

data(atropellados_pts)

Format

A data frame with 1,841 observations and 8 variables:

X: Longitude coordinate of the accident (numeric)
Y: Latitude coordinate of the accident (numeric)
Ubicacion: Location description of the accident (character)
Fallecidos: Number of fatalities (integer)
Graves: Number of serious injuries (integer)
MenosGrave: Number of less serious injuries (integer)
Leve: Number of minor injuries (integer)
Accidentes: Total number of accidents at the location (integer)

Details

The dataset name has been kept as 'atropellados_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from Kaggle: https://www.kaggle.com/datasets/sandorabad/georeferenced-car-accidents-santiago-de-chile?select=AtropellosGS2015.csv

Infant Mortality in Auckland, New Zealand (1977–1985)

Description

This dataset, 'auckland_poly', is a data frame containing information on infant mortality in census area units (CAUs) of Auckland, New Zealand. The dataset has 167 rows, each corresponding to a CAU, and 4 columns with geographic coordinates and mortality-related statistics. It is often used in spatial epidemiology studies and in demonstrations of spatial analysis methods.

Usage

data(auckland_poly)

Format

A data frame with 167 observations and 4 variables:

Easting: Easting coordinate (numeric)
Northing: Northing coordinate (numeric)
Deaths.1977.85: Number of infant deaths between 1977 and 1985 (numeric)
Under.5.1981: Population under age 5 in 1981 (numeric)

Details

In addition to the 'auckland_poly' data frame, the original source also provides two related spatial objects: 'auckland.nb', a neighbour list of CAUs based on contiguity, and 'auckpolys', a polylist object representing polygon boundaries. These are not included here, but can be generated from the original dataset using spatial analysis workflows.

The dataset name has been kept as 'auckland_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4.

Bacterial Production Sampling Points in Lake St. Pierre (2005)

Description

This dataset, bacprodxy_pts, is a data frame containing the geographical coordinates (longitude and latitude) of 25 sampling locations where bacterial production was measured in Lake St. Pierre (Québec, Canada). The samples were collected on August 18, 2005.

Usage

data(bacprodxy_pts)

Format

A data frame with 25 observations and 2 variables:

Longitude: Longitude coordinate of the sampling point (numeric)
Latitude: Latitude coordinate of the sampling point (numeric)

Details

The dataset name has been kept as 'bacprodxy_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the adespatial package version 0.3-28

Housing Sales in Baltimore, Maryland (1978)

Description

This dataset, 'baltimore_pts', is a data frame containing housing sales data and property characteristics for Baltimore, Maryland, in 1978. It has been widely used in spatial econometrics and hedonic regression studies. Each row corresponds to a house, including sale price, structural attributes, lot size, and geographic coordinates (X, Y) on the Maryland grid (projection type unknown).

Usage

data(baltimore_pts)

Format

A data frame with 211 observations and 17 variables:

STATION: Census tract station identifier (integer)
PRICE: House sale price (numeric)
NROOM: Number of rooms (numeric)
DWELL: Dwelling type indicator (numeric)
NBATH: Number of bathrooms (numeric)
PATIO: Presence of patio (numeric indicator)
FIREPL: Presence of fireplace (numeric indicator)
AC: Presence of air conditioning (numeric indicator)
BMENT: Presence of basement (numeric indicator)
NSTOR: Number of stories (numeric)
GAR: Presence of garage (numeric indicator)
AGE: Age of the dwelling (numeric)
CITCOU: City/county indicator (numeric)
LOTSZ: Lot size (numeric)
SQFT: Interior square footage (numeric)
X: X coordinate (numeric)
Y: Y coordinate (numeric)

Details

The dataset consists of 211 observations (houses) and 17 variables.

The dataset name has been kept as 'baltimore_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4.

Boston Housing Data with Geographic Coordinates

Description

This dataset, boston_pts, is a data frame containing information on housing values and neighborhood characteristics in the Boston area. It is based on the classic dataset by Harrison and Rubinfeld (1978), corrected for minor errors and augmented with the latitude and longitude of the observations. Gilley and Pace also note that the MEDV variable is censored, with values at or over USD 50,000 set to USD 50,000.

Usage

data(boston_pts)

Format

A data frame with 506 observations and 20 variables:

TOWN: Town name (factor with 92 levels)
TOWNNO: Town number (integer)
TRACT: Census tract number (integer)
LON: Longitude (numeric)
LAT: Latitude (numeric)
MEDV: Median value of owner-occupied homes in USD 1,000s (numeric, censored at 50)
CMEDV: Corrected median value of owner-occupied homes (numeric)
CRIM: Per capita crime rate by town (numeric)
ZN: Proportion of residential land zoned for lots over 25,000 sq.ft. (numeric)
INDUS: Proportion of non-retail business acres per town (numeric)
CHAS: Charles River dummy variable (factor: "0" = not bounded, "1" = bounded)
NOX: Nitric oxides concentration (parts per 10 million, numeric)
RM: Average number of rooms per dwelling (numeric)
AGE: Proportion of owner-occupied units built prior to 1940 (numeric)
DIS: Weighted distances to five Boston employment centers (numeric)
RAD: Index of accessibility to radial highways (integer)
TAX: Full-value property-tax rate per $10,000 (integer)
PTRATIO: Pupil-teacher ratio by town (numeric)
B: Proportion of Black residents, defined as 1000(Bk - 0.63)^2 (numeric)
LSTAT: Percentage of lower status of the population (numeric)

Details

The dataset consists of 506 observations and 20 variables, including socio-economic, environmental, and housing characteristics. Geographic coordinates (longitude and latitude) are provided for spatial analysis. Related data objects include boston.utm, a matrix of tract point coordinates projected to UTM zone 19, and boston.soi, a sphere of influence neighbors list.

The dataset name has been kept as boston_pts to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The suffix pts indicates that the dataset includes spatial point information. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4

World Coffee Production Data

Description

This dataset, coffee_poly, is a tibble containing estimates of global coffee production by country. The data represent thousands of 60 kg bags of coffee produced in 2016 and 2017. It is intended for teaching purposes only and not for research use.

Usage

data(coffee_poly)

Format

A tibble with 47 observations and 3 variables:

name_long: Country name (character)
coffee_production_2016: Coffee production in 2016, in thousands of 60 kg bags (integer)
coffee_production_2017: Coffee production in 2017, in thousands of 60 kg bags (integer)

Details

The dataset consists of 47 observations (countries) and 3 variables, including the country name and production values for two years. The data provide a simple example of tabular international production figures that can be used in spatial and non-spatial analyses.

The dataset name has been kept as coffee_poly to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The suffix poly indicates that the dataset can be linked to polygon boundaries for mapping. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4

Columbus Neighborhood Data (1980)

Description

This dataset, columbus_poly, is a data frame containing socioeconomic and housing characteristics for 49 neighborhoods in Columbus, Ohio, based on 1980 data. The dataset is widely used in spatial econometrics and geographic analysis.

Usage

data(columbus_poly)

Format

A data frame with 49 observations and 22 variables:

AREA: Area of the neighborhood (numeric)
PERIMETER: Perimeter of the neighborhood (numeric)
COLUMBUS.: Identifier variable (integer)
COLUMBUS.I: Identifier variable (integer)
POLYID: Polygon ID (integer)
NEIG: Neighborhood ID (integer)
HOVAL: Housing value (numeric)
INC: Household income (numeric)
CRIME: Crime rate (numeric)
OPEN: Open space (numeric)
PLUMB: Plumbing quality (numeric)
DISCBD: Distance to central business district (numeric)
X: X coordinate of centroid (numeric)
Y: Y coordinate of centroid (numeric)
AREA: Area variable (numeric, duplicated)
NSA: Neighborhood spatial attribute A (numeric)
NSB: Neighborhood spatial attribute B (numeric)
EW: East/West indicator (numeric)
CP: Central place indicator (numeric)
THOUS: Thousands of dollars (numeric)
NEIGNO: Neighborhood number (numeric)
PERIM: Perimeter variable (numeric, duplicated)

Details

In addition to the attributes, the original dataset also included a polygon list of neighborhood boundaries, a centroid matrix, and a neighbor list object, although these are not part of columbus_poly. The matrix bbs is deprecated but retained in other packages for compatibility.

The dataset name has been kept as columbus_poly to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The suffix poly indicates that the dataset can be linked to polygon boundaries. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4

Georeferenced Forest Fires in Chile (2016–2017 Season)

Description

This dataset, 'conafchile_pts', is a data frame containing georeferenced forest fire records and associated characteristics between July 1, 2016, and June 30, 2017. The dataset includes detailed information such as location, administrative codes, fire causes, vegetation affected, and surface area impacted. The data were compiled by CONAF and correspond to forest fires recorded in Chile.

Usage

data(conafchile_pts)

Format

A data frame with 5,234 observations and 30 variables:

X: Index of the fire record (integer)
temporada: Fire season (character, e.g., "2016-2017")
codreg: Region code (integer)
codprov: Province code (integer)
codcom: Commune code (integer)
ambito: Institutional scope (character, e.g., "Conaf")
numero: Fire identification number (numeric)
nombre_inc: Name of the fire incident (character)
utm_este: UTM Easting coordinate (numeric)
utm_norte: UTM Northing coordinate (numeric)
inicio_c: Location of ignition (character)
combus_i: Initial fuel type (character)
causa_gene: General cause code (numeric)
causa_espe: Specific cause code (character)
pino_0010: Surface with pine (0–10 years old) affected (numeric)
pino_11_17: Surface with pine (11–17 years old) affected (numeric)
pino_18: Surface with pine (18+ years old) affected (numeric)
eucalipto: Surface with eucalyptus affected (numeric)
otras_plan: Surface with other plantations affected (numeric)
total_plan: Total surface of plantations affected (numeric)
arbolado: Surface of woodland affected (numeric)
matorral: Surface of shrubland affected (numeric)
pastizal: Surface of grassland affected (numeric)
total_veg: Total surface of vegetation affected (numeric)
agricola: Surface of agricultural land affected (numeric)
desechos: Surface of waste material affected (numeric)
total_otra: Total surface of other land use affected (numeric)
sup_t_a: Total affected surface area (numeric)
long: Longitude or projected coordinate (numeric)
lat: Latitude or projected coordinate (numeric)

Details

The dataset name has been kept as 'conafchile_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The suffix 'pts' indicates that the dataset contains georeferenced point data. The original content has not been modified in any way.

Source

Data taken from Kaggle: https://www.kaggle.com/datasets/sandorabad/georeferenced-forestfires-2017-chile

Countries Latitude-Longitude Dataset

Description

This dataset, countries_pts, is a data frame containing information on 245 countries, including their names and geographical coordinates (latitude and longitude). It provides a simple reference for mapping and spatial analysis.

Usage

data(countries_pts)

Format

A data frame with 245 observations and 4 variables:

country: Country code or identifier (character)
latitude: Latitude of the country (numeric)
longitude: Longitude of the country (numeric)
name: Country name (character)

Details

The dataset name has been kept as 'countries_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from Kaggle: https://www.kaggle.com/datasets/arviinndn/countries

Cycle Hire Stations in London

Description

This dataset, cyclehire_pts, is an sf object containing point locations of cycle hire stations across London. Each observation represents a hire point with information about its name, area, number of available bikes, and number of empty docking slots at the time of data collection.

Usage

data(cyclehire_pts)

Format

An sf object with 742 observations and 6 variables:

id: Station identifier (integer)
name: Name of the station (factor)
area: Area of London where the station is located (factor with 121 levels)
nbikes: Number of bikes available (integer)
nempty: Number of empty docking slots (integer)
geometry: Point geometry in XY coordinates (sfc_POINT)

Details

The dataset name has been kept as cyclehire_pts to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The suffix pts indicates that the dataset contains point geometries. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4

Washington, D.C. Census Tract Data (ACS 2020)

Description

This dataset, 'dc_poly', is an 'sf' object containing population and median household income information for census tracts in Washington, D.C., based on the 2020 American Community Survey (ACS). It also includes spatial polygon geometries, allowing the data to be used directly for mapping and spatial analysis, such as creating choropleth maps of demographic and socioeconomic indicators.

Usage

data(dc_poly)

Format

An 'sf' data frame with 206 observations and 5 variables:

GEOID: Unique identifier for the census tract (character)
NAME: Census tract name and jurisdiction (character)
geometry: Polygon geometry representing the tract boundaries ('sfc_POLYGON')
B01003_001: Total population of the tract (numeric)
B19013_001: Median household income of the tract (numeric, in USD)

Details

The dataset consists of 206 observations (census tracts) and 5 variables. The geometry column contains polygon boundaries for each tract.

The dataset name has been kept as 'dc_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the bivariateLeaflet package version 0.1.0

California Housing Prices (1990 Census)

Description

This dataset, 'housing_pts', is a data frame containing information on median house prices for California districts, derived from the 1990 census. It includes geographic coordinates, demographic and housing characteristics, and district-level income and housing attributes. The dataset consists of 20,640 observations and 10 variables. Missing values may be present in some variables.

Usage

data(housing_pts)

Format

A data frame with 20,640 observations and 10 variables:

longitude: Longitude coordinate of the district (numeric)
latitude: Latitude coordinate of the district (numeric)
housing_median_age: Median age of houses in the district (numeric)
total_rooms: Total number of rooms in the district (numeric)
total_bedrooms: Total number of bedrooms in the district (numeric)
population: Population of the district (numeric)
households: Number of households in the district (numeric)
median_income: Median income in the district (numeric)
median_house_value: Median house value in the district (numeric, in US dollars)
ocean_proximity: Proximity of the district to the ocean (character string categories)

Details

The dataset name has been kept as 'housing_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of your package and assists users in identifying its specific characteristics. The suffix 'pts' indicates that the dataset contains georeferenced point data. The original content has not been modified in any way.

Source

Data taken from Kaggle: https://www.kaggle.com/datasets/camnugent/california-housing-prices

Mastigouche Lake Network Data Set

Description

This dataset, mastigouche_poly, is a list containing spatial and network information for 42 lakes in the Mastigouche region. The dataset includes the XY geographical coordinates of the lakes and a site-by-edge matrix describing how the lakes influence each other. The network is defined by 66 directional edges of influence between the lakes.

Usage

data(mastigouche_poly)

Format

A list with 2 elements:

xy: A data frame with 42 observations and 2 variables: X (numeric), Y (numeric) coordinates of the lakes
siteEdge: An integer site-by-edge matrix describing 66 edges of influence among lakes

Details

The dataset name has been kept as 'mastigouche_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the adespatial package version 0.3-28

Mildly Clustered Points in North Carolina, United States

Description

This dataset, 'nc_points', is a data frame containing a set of spatial point coordinates representing mildly clustered points in North Carolina, United States. The dataset consists of 2,304 observations and 2 variables, corresponding to the X and Y coordinates of the points. The data can be used for examples of point pattern analysis, clustering, or spatial statistics.

Usage

data(nc_points)

Format

A data frame with 2,304 observations and 2 variables:

X: X coordinate (numeric)
Y: Y coordinate (numeric)

Details

The dataset name has been kept as 'nc_points' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The suffix does not include '_df' because the dataset primarily represents a spatial point pattern rather than general tabular survey data. The original content has not been modified in any way.

Source

Data taken from the chopin package version 0.9.4

World Bank Socioeconomic Indicators by Country

Description

This dataset, worldbank_poly, is a data frame containing selected socioeconomic indicators compiled from the World Bank. The dataset includes 177 observations (countries) and 7 variables such as Human Development Index (HDI), urban population percentage, unemployment rate, population growth, and literacy rate. Some values may be missing.

Usage

data(worldbank_poly)

Format

A data frame (tibble) with 177 observations and 7 variables:

name: Country name (character)
iso_a2: ISO 2-letter country code (character)
HDI: Human Development Index (numeric)
urban_pop: Urban population percentage (numeric)
unemployment: Unemployment rate (numeric)
pop_growth: Population growth rate (numeric)
literacy: Literacy rate (numeric)

Details

The dataset name has been kept as 'worldbank_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.

Source

Data taken from the spData package version 2.3.4

lightsf: Collection of georeferenced and spatial datasets from different domains

Description

Details

Author(s)

See Also

Spatial Patterns of Conflict in Africa (1966–1978)

Description

Usage

Format

Details

Source

References

Georeferenced Pedestrian Car Collisions (2015, Santiago de Chile)

Description

Usage

Format

Details

Source

Infant Mortality in Auckland, New Zealand (1977–1985)

Description

Usage

Format

Details

Source

Bacterial Production Sampling Points in Lake St. Pierre (2005)

Description

Usage

Format

Details

Source

Housing Sales in Baltimore, Maryland (1978)

Description

Usage

Format

Details

Source

Boston Housing Data with Geographic Coordinates

Description

Usage

Format

Details

Source

World Coffee Production Data

Description

Usage

Format

Details

Source

Columbus Neighborhood Data (1980)

Description

Usage

Format

Details

Source

Georeferenced Forest Fires in Chile (2016–2017 Season)

Description

Usage

Format

Details

Source

Countries Latitude-Longitude Dataset

Description

Usage

Format

Details

Source

Cycle Hire Stations in London

Description

Usage

Format

Details

Source

Washington, D.C. Census Tract Data (ACS 2020)

Description

Usage

Format

Details

Source

California Housing Prices (1990 Census)

Description