Type: | Package |
Title: | A Curated Collection of Georeferenced and Spatial Datasets |
Version: | 0.1.0 |
Maintainer: | Ingrid Romero Pinilla <ingridpinilla11@gmail.com> |
Description: | Provides a diverse collection of georeferenced and spatial datasets from different domains including urban studies, housing markets, environmental monitoring, transportation, and socio-economic indicators. The package consolidates datasets from multiple open sources such as Kaggle, chopin, spData, adespatial, and bivariateLeaflet. It is designed for researchers, analysts, and educators interested in spatial analysis, geostatistics, and geographic data visualization. The datasets include point patterns, polygons, socio-economic data frames, and network-like structures, allowing flexible exploration of geospatial phenomena. |
License: | GPL-3 |
URL: | https://github.com/roming20/lightsf, https://roming20.github.io/lightsf/ |
BugReports: | https://github.com/roming20/lightsf/issues |
Encoding: | UTF-8 |
LazyData: | true |
Suggests: | ggplot2, dplyr, testthat (≥ 3.0.0), knitr, rmarkdown |
RoxygenNote: | 7.3.2 |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-10-14 03:40:58 UTC; ingri |
Author: | Ingrid Romero Pinilla [aut, cre] |
Depends: | R (≥ 3.5.0) |
Repository: | CRAN |
Date/Publication: | 2025-10-19 13:10:02 UTC |
lightsf: Collection of georeferenced and spatial datasets from different domains
Description
Provides a diverse collection of georeferenced and spatial datasets from different domains including urban studies, housing markets, environmental monitoring, transportation, and socio-economic indicators. The package consolidates datasets from multiple open sources such as Kaggle, chopin, spData, adespatial, and bivariateLeaflet. It is designed for researchers, analysts, and educators interested in spatial analysis, geostatistics, and geographic data visualization. The datasets include point patterns, polygons, socio-economic data frames, and network-like structures, allowing flexible exploration of geospatial phenomena.
Details
lightsf - Collection of georeferenced and spatial datasets from different domains.
Collection of georeferenced and spatial datasets from different domains.
Author(s)
Maintainer: Ingrid Romero Pinilla ingridpinilla11@gmail.com
See Also
Useful links:
Spatial Patterns of Conflict in Africa (1966–1978)
Description
This dataset, 'afcon_poly', is a data frame summarizing spatial patterns of conflict across 42 African countries between 1966 and 1978. The dataset was originally used in Anselin (1995) to study spatial autocorrelation in political conflict. It excludes South West Africa, Spanish Equatorial Africa, and Spanish Sahara. The dataset includes centroid coordinates, country names, and the total number of recorded conflicts during this period.
Usage
data(afcon_poly)
Format
A data frame with 42 observations and 5 variables:
- x
Longitude coordinate of the country centroid (numeric)
- y
Latitude coordinate of the country centroid (numeric)
- totcon
Total number of conflicts recorded, 1966–1978 (numeric)
- name
Name of the country (factor with 42 levels)
- id
Numeric country identifier (numeric)
Details
The dataset consists of 42 observations (countries) and 5 variables.
The dataset name has been kept as 'afcon_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4.
References
Anselin, L. (1995). Local Indicators of Spatial Association—LISA. *Geographical Analysis*, 27(2), 93–115.
Georeferenced Pedestrian Car Collisions (2015, Santiago de Chile)
Description
This dataset, atropellados_pts, is a data frame containing information on pedestrian car collisions that occurred in Santiago de Chile in 2015. Each record includes the geographical coordinates of the accident, location description, and the number of victims categorized by severity (fatal, serious, less serious, and minor).
Usage
data(atropellados_pts)
Format
A data frame with 1,841 observations and 8 variables:
- X
Longitude coordinate of the accident (numeric)
- Y
Latitude coordinate of the accident (numeric)
- Ubicacion
Location description of the accident (character)
- Fallecidos
Number of fatalities (integer)
- Graves
Number of serious injuries (integer)
- MenosGrave
Number of less serious injuries (integer)
- Leve
Number of minor injuries (integer)
- Accidentes
Total number of accidents at the location (integer)
Details
The dataset name has been kept as 'atropellados_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from Kaggle: https://www.kaggle.com/datasets/sandorabad/georeferenced-car-accidents-santiago-de-chile?select=AtropellosGS2015.csv
Infant Mortality in Auckland, New Zealand (1977–1985)
Description
This dataset, 'auckland_poly', is a data frame containing information on infant mortality in census area units (CAUs) of Auckland, New Zealand. The dataset has 167 rows, each corresponding to a CAU, and 4 columns with geographic coordinates and mortality-related statistics. It is often used in spatial epidemiology studies and in demonstrations of spatial analysis methods.
Usage
data(auckland_poly)
Format
A data frame with 167 observations and 4 variables:
- Easting
Easting coordinate (numeric)
- Northing
Northing coordinate (numeric)
- Deaths.1977.85
Number of infant deaths between 1977 and 1985 (numeric)
- Under.5.1981
Population under age 5 in 1981 (numeric)
Details
In addition to the 'auckland_poly' data frame, the original source also provides two related spatial objects: 'auckland.nb', a neighbour list of CAUs based on contiguity, and 'auckpolys', a polylist object representing polygon boundaries. These are not included here, but can be generated from the original dataset using spatial analysis workflows.
The dataset name has been kept as 'auckland_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4.
Bacterial Production Sampling Points in Lake St. Pierre (2005)
Description
This dataset, bacprodxy_pts, is a data frame containing the geographical coordinates (longitude and latitude) of 25 sampling locations where bacterial production was measured in Lake St. Pierre (Québec, Canada). The samples were collected on August 18, 2005.
Usage
data(bacprodxy_pts)
Format
A data frame with 25 observations and 2 variables:
- Longitude
Longitude coordinate of the sampling point (numeric)
- Latitude
Latitude coordinate of the sampling point (numeric)
Details
The dataset name has been kept as 'bacprodxy_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the adespatial package version 0.3-28
Housing Sales in Baltimore, Maryland (1978)
Description
This dataset, 'baltimore_pts', is a data frame containing housing sales data and property characteristics for Baltimore, Maryland, in 1978. It has been widely used in spatial econometrics and hedonic regression studies. Each row corresponds to a house, including sale price, structural attributes, lot size, and geographic coordinates (X, Y) on the Maryland grid (projection type unknown).
Usage
data(baltimore_pts)
Format
A data frame with 211 observations and 17 variables:
- STATION
Census tract station identifier (integer)
- PRICE
House sale price (numeric)
- NROOM
Number of rooms (numeric)
- DWELL
Dwelling type indicator (numeric)
- NBATH
Number of bathrooms (numeric)
- PATIO
Presence of patio (numeric indicator)
- FIREPL
Presence of fireplace (numeric indicator)
- AC
Presence of air conditioning (numeric indicator)
- BMENT
Presence of basement (numeric indicator)
- NSTOR
Number of stories (numeric)
- GAR
Presence of garage (numeric indicator)
- AGE
Age of the dwelling (numeric)
- CITCOU
City/county indicator (numeric)
- LOTSZ
Lot size (numeric)
- SQFT
Interior square footage (numeric)
- X
X coordinate (numeric)
- Y
Y coordinate (numeric)
Details
The dataset consists of 211 observations (houses) and 17 variables.
The dataset name has been kept as 'baltimore_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4.
Boston Housing Data with Geographic Coordinates
Description
This dataset, boston_pts
, is a data frame containing information on housing values
and neighborhood characteristics in the Boston area. It is based on the classic dataset
by Harrison and Rubinfeld (1978), corrected for minor errors and augmented with the latitude
and longitude of the observations. Gilley and Pace also note that the MEDV
variable
is censored, with values at or over USD 50,000 set to USD 50,000.
Usage
data(boston_pts)
Format
A data frame with 506 observations and 20 variables:
- TOWN
Town name (factor with 92 levels)
- TOWNNO
Town number (integer)
- TRACT
Census tract number (integer)
- LON
Longitude (numeric)
- LAT
Latitude (numeric)
- MEDV
Median value of owner-occupied homes in USD 1,000s (numeric, censored at 50)
- CMEDV
Corrected median value of owner-occupied homes (numeric)
- CRIM
Per capita crime rate by town (numeric)
- ZN
Proportion of residential land zoned for lots over 25,000 sq.ft. (numeric)
- INDUS
Proportion of non-retail business acres per town (numeric)
- CHAS
Charles River dummy variable (factor: "0" = not bounded, "1" = bounded)
- NOX
Nitric oxides concentration (parts per 10 million, numeric)
- RM
Average number of rooms per dwelling (numeric)
- AGE
Proportion of owner-occupied units built prior to 1940 (numeric)
- DIS
Weighted distances to five Boston employment centers (numeric)
- RAD
Index of accessibility to radial highways (integer)
- TAX
Full-value property-tax rate per
$10,000
(integer)- PTRATIO
Pupil-teacher ratio by town (numeric)
- B
Proportion of Black residents, defined as 1000(Bk - 0.63)^2 (numeric)
- LSTAT
Percentage of lower status of the population (numeric)
Details
The dataset consists of 506 observations and 20 variables, including socio-economic,
environmental, and housing characteristics. Geographic coordinates (longitude and latitude)
are provided for spatial analysis. Related data objects include boston.utm
, a matrix
of tract point coordinates projected to UTM zone 19, and boston.soi
, a sphere of
influence neighbors list.
The dataset name has been kept as boston_pts
to avoid confusion with other datasets
in the R ecosystem. This naming convention helps distinguish this dataset as part of the
lightsf
package and assists users in identifying its specific characteristics.
The suffix pts
indicates that the dataset includes spatial point information.
The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4
World Coffee Production Data
Description
This dataset, coffee_poly
, is a tibble containing estimates of global coffee production
by country. The data represent thousands of 60 kg bags of coffee produced in 2016 and 2017.
It is intended for teaching purposes only and not for research use.
Usage
data(coffee_poly)
Format
A tibble with 47 observations and 3 variables:
- name_long
Country name (character)
- coffee_production_2016
Coffee production in 2016, in thousands of 60 kg bags (integer)
- coffee_production_2017
Coffee production in 2017, in thousands of 60 kg bags (integer)
Details
The dataset consists of 47 observations (countries) and 3 variables, including the country name and production values for two years. The data provide a simple example of tabular international production figures that can be used in spatial and non-spatial analyses.
The dataset name has been kept as coffee_poly
to avoid confusion with other datasets
in the R ecosystem. This naming convention helps distinguish this dataset as part of the
lightsf
package and assists users in identifying its specific characteristics.
The suffix poly
indicates that the dataset can be linked to polygon boundaries for mapping.
The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4
Columbus Neighborhood Data (1980)
Description
This dataset, columbus_poly
, is a data frame containing socioeconomic and housing
characteristics for 49 neighborhoods in Columbus, Ohio, based on 1980 data.
The dataset is widely used in spatial econometrics and geographic analysis.
Usage
data(columbus_poly)
Format
A data frame with 49 observations and 22 variables:
- AREA
Area of the neighborhood (numeric)
- PERIMETER
Perimeter of the neighborhood (numeric)
- COLUMBUS.
Identifier variable (integer)
- COLUMBUS.I
Identifier variable (integer)
- POLYID
Polygon ID (integer)
- NEIG
Neighborhood ID (integer)
- HOVAL
Housing value (numeric)
- INC
Household income (numeric)
- CRIME
Crime rate (numeric)
- OPEN
Open space (numeric)
- PLUMB
Plumbing quality (numeric)
- DISCBD
Distance to central business district (numeric)
- X
X coordinate of centroid (numeric)
- Y
Y coordinate of centroid (numeric)
- AREA
Area variable (numeric, duplicated)
- NSA
Neighborhood spatial attribute A (numeric)
- NSB
Neighborhood spatial attribute B (numeric)
- EW
East/West indicator (numeric)
- CP
Central place indicator (numeric)
- THOUS
Thousands of dollars (numeric)
- NEIGNO
Neighborhood number (numeric)
- PERIM
Perimeter variable (numeric, duplicated)
Details
In addition to the attributes, the original dataset also included a polygon list
of neighborhood boundaries, a centroid matrix, and a neighbor list object, although
these are not part of columbus_poly
. The matrix bbs
is deprecated but
retained in other packages for compatibility.
The dataset name has been kept as columbus_poly
to avoid confusion with other datasets
in the R ecosystem. This naming convention helps distinguish this dataset as part of the
lightsf
package and assists users in identifying its specific characteristics.
The suffix poly
indicates that the dataset can be linked to polygon boundaries.
The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4
Georeferenced Forest Fires in Chile (2016–2017 Season)
Description
This dataset, 'conafchile_pts', is a data frame containing georeferenced forest fire records and associated characteristics between July 1, 2016, and June 30, 2017. The dataset includes detailed information such as location, administrative codes, fire causes, vegetation affected, and surface area impacted. The data were compiled by CONAF and correspond to forest fires recorded in Chile.
Usage
data(conafchile_pts)
Format
A data frame with 5,234 observations and 30 variables:
- X
Index of the fire record (integer)
- temporada
Fire season (character, e.g., "2016-2017")
- codreg
Region code (integer)
- codprov
Province code (integer)
- codcom
Commune code (integer)
- ambito
Institutional scope (character, e.g., "Conaf")
- numero
Fire identification number (numeric)
- nombre_inc
Name of the fire incident (character)
- utm_este
UTM Easting coordinate (numeric)
- utm_norte
UTM Northing coordinate (numeric)
- inicio_c
Location of ignition (character)
- combus_i
Initial fuel type (character)
- causa_gene
General cause code (numeric)
- causa_espe
Specific cause code (character)
- pino_0010
Surface with pine (0–10 years old) affected (numeric)
- pino_11_17
Surface with pine (11–17 years old) affected (numeric)
- pino_18
Surface with pine (18+ years old) affected (numeric)
- eucalipto
Surface with eucalyptus affected (numeric)
- otras_plan
Surface with other plantations affected (numeric)
- total_plan
Total surface of plantations affected (numeric)
- arbolado
Surface of woodland affected (numeric)
- matorral
Surface of shrubland affected (numeric)
- pastizal
Surface of grassland affected (numeric)
- total_veg
Total surface of vegetation affected (numeric)
- agricola
Surface of agricultural land affected (numeric)
- desechos
Surface of waste material affected (numeric)
- total_otra
Total surface of other land use affected (numeric)
- sup_t_a
Total affected surface area (numeric)
- long
Longitude or projected coordinate (numeric)
- lat
Latitude or projected coordinate (numeric)
Details
The dataset name has been kept as 'conafchile_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The suffix 'pts' indicates that the dataset contains georeferenced point data. The original content has not been modified in any way.
Source
Data taken from Kaggle: https://www.kaggle.com/datasets/sandorabad/georeferenced-forestfires-2017-chile
Countries Latitude-Longitude Dataset
Description
This dataset, countries_pts, is a data frame containing information on 245 countries, including their names and geographical coordinates (latitude and longitude). It provides a simple reference for mapping and spatial analysis.
Usage
data(countries_pts)
Format
A data frame with 245 observations and 4 variables:
- country
Country code or identifier (character)
- latitude
Latitude of the country (numeric)
- longitude
Longitude of the country (numeric)
- name
Country name (character)
Details
The dataset name has been kept as 'countries_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from Kaggle: https://www.kaggle.com/datasets/arviinndn/countries
Cycle Hire Stations in London
Description
This dataset, cyclehire_pts
, is an sf
object containing point locations
of cycle hire stations across London. Each observation represents a hire point with
information about its name, area, number of available bikes, and number of empty
docking slots at the time of data collection.
Usage
data(cyclehire_pts)
Format
An sf
object with 742 observations and 6 variables:
- id
Station identifier (integer)
- name
Name of the station (factor)
- area
Area of London where the station is located (factor with 121 levels)
- nbikes
Number of bikes available (integer)
- nempty
Number of empty docking slots (integer)
- geometry
Point geometry in XY coordinates (
sfc_POINT
)
Details
The dataset name has been kept as cyclehire_pts
to avoid confusion with other datasets
in the R ecosystem. This naming convention helps distinguish this dataset as part of the
lightsf
package and assists users in identifying its specific characteristics.
The suffix pts
indicates that the dataset contains point geometries.
The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4
Washington, D.C. Census Tract Data (ACS 2020)
Description
This dataset, 'dc_poly', is an 'sf' object containing population and median household income information for census tracts in Washington, D.C., based on the 2020 American Community Survey (ACS). It also includes spatial polygon geometries, allowing the data to be used directly for mapping and spatial analysis, such as creating choropleth maps of demographic and socioeconomic indicators.
Usage
data(dc_poly)
Format
An 'sf' data frame with 206 observations and 5 variables:
- GEOID
Unique identifier for the census tract (character)
- NAME
Census tract name and jurisdiction (character)
- geometry
Polygon geometry representing the tract boundaries ('
sfc_POLYGON
')- B01003_001
Total population of the tract (numeric)
- B19013_001
Median household income of the tract (numeric, in USD)
Details
The dataset consists of 206 observations (census tracts) and 5 variables. The geometry column contains polygon boundaries for each tract.
The dataset name has been kept as 'dc_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the bivariateLeaflet package version 0.1.0
California Housing Prices (1990 Census)
Description
This dataset, 'housing_pts', is a data frame containing information on median house prices for California districts, derived from the 1990 census. It includes geographic coordinates, demographic and housing characteristics, and district-level income and housing attributes. The dataset consists of 20,640 observations and 10 variables. Missing values may be present in some variables.
Usage
data(housing_pts)
Format
A data frame with 20,640 observations and 10 variables:
- longitude
Longitude coordinate of the district (numeric)
- latitude
Latitude coordinate of the district (numeric)
- housing_median_age
Median age of houses in the district (numeric)
- total_rooms
Total number of rooms in the district (numeric)
- total_bedrooms
Total number of bedrooms in the district (numeric)
- population
Population of the district (numeric)
- households
Number of households in the district (numeric)
- median_income
Median income in the district (numeric)
- median_house_value
Median house value in the district (numeric, in US dollars)
- ocean_proximity
Proximity of the district to the ocean (character string categories)
Details
The dataset name has been kept as 'housing_pts' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of your package and assists users in identifying its specific characteristics. The suffix 'pts' indicates that the dataset contains georeferenced point data. The original content has not been modified in any way.
Source
Data taken from Kaggle: https://www.kaggle.com/datasets/camnugent/california-housing-prices
Mastigouche Lake Network Data Set
Description
This dataset, mastigouche_poly, is a list containing spatial and network information for 42 lakes in the Mastigouche region. The dataset includes the XY geographical coordinates of the lakes and a site-by-edge matrix describing how the lakes influence each other. The network is defined by 66 directional edges of influence between the lakes.
Usage
data(mastigouche_poly)
Format
A list with 2 elements:
- xy
A data frame with 42 observations and 2 variables: X (numeric), Y (numeric) coordinates of the lakes
- siteEdge
An integer site-by-edge matrix describing 66 edges of influence among lakes
Details
The dataset name has been kept as 'mastigouche_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the adespatial package version 0.3-28
Mildly Clustered Points in North Carolina, United States
Description
This dataset, 'nc_points', is a data frame containing a set of spatial point coordinates representing mildly clustered points in North Carolina, United States. The dataset consists of 2,304 observations and 2 variables, corresponding to the X and Y coordinates of the points. The data can be used for examples of point pattern analysis, clustering, or spatial statistics.
Usage
data(nc_points)
Format
A data frame with 2,304 observations and 2 variables:
- X
X coordinate (numeric)
- Y
Y coordinate (numeric)
Details
The dataset name has been kept as 'nc_points' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the 'lightsf' package and assists users in identifying its specific characteristics. The suffix does not include '_df' because the dataset primarily represents a spatial point pattern rather than general tabular survey data. The original content has not been modified in any way.
Source
Data taken from the chopin package version 0.9.4
World Bank Socioeconomic Indicators by Country
Description
This dataset, worldbank_poly, is a data frame containing selected socioeconomic indicators compiled from the World Bank. The dataset includes 177 observations (countries) and 7 variables such as Human Development Index (HDI), urban population percentage, unemployment rate, population growth, and literacy rate. Some values may be missing.
Usage
data(worldbank_poly)
Format
A data frame (tibble) with 177 observations and 7 variables:
- name
Country name (character)
- iso_a2
ISO 2-letter country code (character)
- HDI
Human Development Index (numeric)
- urban_pop
Urban population percentage (numeric)
- unemployment
Unemployment rate (numeric)
- pop_growth
Population growth rate (numeric)
- literacy
Literacy rate (numeric)
Details
The dataset name has been kept as 'worldbank_poly' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the lightsf package and assists users in identifying its specific characteristics. The original content has not been modified in any way.
Source
Data taken from the spData package version 2.3.4