Recently, I had the fun experience of converting a CSV file that contained the names of a bunch of cities inside Australia. The CSV file contained four columns:
- City name
- Latitude (in DMS)
- Longitude (in DMS)
There were more than 5,000 rows in the CSV file. I wanted to pull a list of all the cities, and determine what state they were in.
Instead of using the Google Maps API (and getting charged ~$35), I decided to hunt down a way that I could use open data provided by the Australian Bureau of Statistics to add the state name to the list of cities.
The ABS has a downloadable ESRI Shapefile, which basically is a vector map of all the borders for states in Australia. This, with a little bit of data cleaning using Pandas DataPrep, and a quick dirty R script would allow me to add the state name to the CSV.
So, first we start with the raw CSV file that I was given:
Suburb Name | lat | long | |
---|---|---|---|
0 | Aberfoyle | 30° 21’30″S | 152° 2’30″E |
1 | Adaminaby | 35° 58’30″S | 148° 51’0″E |
2 | Adelaide | 34° 56’0″S | 138° 35’30″E |
3 | Adelong | 35° 18’35″S | 148° 1’0″E |
4 | Agnes Water | 24° 17’0″S | 151° 49’0″E |
… | … | … | … |
3095 | Yumali | 35° 28’0″S | 139° 51’30″E |
3096 | Yuna | 28° 23’0″S | 114° 54’30″E |
3097 | Yuna East | 28° 25’30″S | 115° 10’0″E |
3098 | Yunta | 32° 35’0″S | 139° 28’0″E |
3099 | Zamia Creek | 24° 32’30″S | 149° 36’0″E |
import pandas csv = pandas.read_csv("city_locations.csv") import pandas as pd import numpy as np from dataprep.clean import clean_lat_long df = pd.DataFrame(csv) cleaned = clean_lat_long(df, lat_col="lat", long_col="long", split=True) cleaned = cleaned[cleaned.lat_clean.lt(0)] cleaned = cleaned[cleaned.long_clean.gt(0)] cleaned.to_csv("clean.csv")
Running this script would give me an output file that contained the latitude and longitude converted from DMS (Degrees, Minutes and Seconds) to decimal values – skipping the rows that could not be processed correctly, ready for the next processing step.
SZU | lat | long | lat_clean | long_clean | |
---|---|---|---|---|---|
0 | Aberfoyle | 30° 21’30″S | 152° 2’30″E | -30.3583 | 152.0417 |
1 | Adaminaby | 35° 58’30″S | 148° 51’0″E | -35.9750 | 148.8500 |
2 | Adelaide | 34° 56’0″S | 138° 35’30″E | -34.9333 | 138.5917 |
3 | Adelong | 35° 18’35″S | 148° 1’0″E | -35.3097 | 148.0167 |
4 | Agnes Water | 24° 17’0″S | 151° 49’0″E | -24.2833 | 151.8167 |
… | … | … | … | … | … |
3092 | Yumali | 35° 28’0″S | 139° 51’30″E | -35.4667 | 139.8583 |
3093 | Yuna | 28° 23’0″S | 114° 54’30″E | -28.3833 | 114.9083 |
3094 | Yuna East | 28° 25’30″S | 115° 10’0″E | -28.4250 | 115.1667 |
3095 | Yunta | 32° 35’0″S | 139° 28’0″E | -32.5833 | 139.4667 |
3096 | Zamia Creek | 24° 32’30″S | 149° 36’0″E | -24.5417 | 149.6000 |
Using Rlang, we import the Shapefile from the ABS, import the cleaned CSV file with decimal lat/long points, and use st_intersects from the sf package to determine what state a given city is in, based on it’s lat/long location.
library(sf) library(dplyr) library(sp) library(progress) map = read_sf("1270055004_sos_2016_aust_shape/SOS_2016_AUST.shp") nc_geom <- st_geometry(map) latLong <- read.csv(file = 'cleaned.csv') crsFormat = st_crs(map) pnts_sf <- st_as_sf(latLong, coords = c('long_clean', 'lat_clean'), crs = crsFormat) pnts <- pnts_sf %>% mutate( intersection = as.integer(st_intersects(geometry, map)), area = if_else(is.na(intersection), '', map$STE_NAME16[intersection]), size = if_else(is.na(intersection), '', map$SOS_NAME16[intersection]), .keep = c("all") ) write.csv(pnts, "with-intersections.csv")
This should leave you with a CSV file that contains the state name that the given points are in.