To support decision-making in an office development project, web scraping was used to collect up-to-date data on London office rentals from HubbleHQ, filling gaps left by traditional datasets. Python’s Selenium library automated interactions to handle dynamic content loading, enabling comprehensive data collection. Key metrics included rental prices, desk availability, and spatial configurations. Challenges like dynamic page loading, missing data, and incomplete geolocation were addressed through automated scrolling, data cleaning, and location approximation methods.
By leveraging web scraping for real-time data acquisition, this project created a detailed dataset on London office rentals, supporting competitive analysis and pricing strategies. The automated approach enabled the capture of dynamic, accurate market data, demonstrating the value of up-to-date information in commercial real estate decision-making.
Figure 1: HubbleHQ interface showing London office rental listings with location markers and filters for office type, team size, and price.
A structured approach using BeautifulSoup and Requests allowed for targeted data extraction. The URL of interest was HubbleHQ, ensuring compliance with the site’s robots.txt guidelines.
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Send request and parse HTML
url = 'https://hubblehq.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract key data fields
offices = soup.find_all('div', class_='office-listing')
data = []
for office in offices:
name = office.find('h2').text
price = office.find('span', class_='price').text
location = office.find('span', class_='location').text
data.append([name, price, location])
# Save data to CSV
df = pd.DataFrame(data, columns=['OfficeName', 'Price', 'Location'])
df.to_csv('office_data.csv', index=False)
To ensure consistency across the dataset, I normalized monetary values and standardized spatial metrics:
Figure 2: Distribution of office spaces in London based on annual rent per square meter and average desk sizes.
Selenium to automate scrolling and simulate interactions to scrape all available data.# Import necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
# Initialize Selenium WebDriver (e.g., for Chrome)
driver = webdriver.Chrome()
# Navigate to HubbleHQ's listings page
driver.get("https://hubblehq.com/")
# Challenge 1: Dynamic Content - Scroll to load all listings
# Loop to scroll to the bottom of the page multiple times to load all dynamic content
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Wait to allow listings to load
# Check if more content loaded
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Challenge 2: Extracting data while handling missing values
listings = driver.find_elements(By.CLASS_NAME, "office-listing")
data = []
for listing in listings:
try:
# Extract information from each listing
name = listing.find_element(By.TAG_NAME, 'h2').text
price = listing.find_element(By.CLASS_NAME, 'price').text
location = listing.find_element(By.CLASS_NAME, 'location').text
# Handle missing prices or descriptions by setting default values
if not price:
price = "N/A"
if not location:
location = "Unknown"
data.append([name, price, location])
except Exception as e:
print(f"Error extracting data from listing: {e}")
# Convert data to a DataFrame for analysis
df = pd.DataFrame(data, columns=['OfficeName', 'Price', 'Location'])
# Challenge 3: Geolocation Issues
# For listings with incomplete geolocation data, fill in with nearest known data point
def approximate_location(location):
# Placeholder function to estimate missing geolocation data
return "Approximate coordinates based on nearby location"
df['Coordinates'] = df['Location'].apply(lambda x: approximate_location(x) if x == "Unknown" else "Exact coordinates")
# Save data to CSV for further analysis
df.to_csv('office_data_cleaned.csv', index=False)
# Close the browser
driver.quit()
With the cleaned dataset, I conducted analyses to reveal trends in office pricing:
This figure presents a spatial and quantitative analysis of office rental values in the Camden area. The chart is divided into two scatter plots on the left, showing rental prices by unit and office, respectively, and a map on the right that indicates the spatial distribution of office listings in Camden. The size and colour of the dots represent the number of desks or people the office space can accommodate, with larger and darker dots indicating higher capacities.
Insights:
Figure 3: Analysis of office rental value in Camden, showing spatial density and rental prices by unit and office
This scatter plot illustrates the correlation between the average annual rent per square meter and the average monthly price per person across various office spaces in London. Each dot represents a different office, with dot sizes corresponding to the average size per desk. The color gradient indicates rent levels, with warmer tones (e.g., orange) representing higher rent values.
Insights:
Figure 4: Correlation between average rent per square meter and price per person in various London office spaces.
This figure shows office rental data for Westminster, focusing on average rental prices per square meter, average monthly price per person, and desk size per office. Each point on the scatter plot represents an office location, with dot size reflecting the average size per desk and color coding by specific areas within Westminster.
Insights:
Figure 5: Analysis of office rentals in Westminster with insights into pricing and average desk size by location
This figure provides a detailed view of office rental values in Wandsworth, using two scatter plots and a spatial map. The scatter plots illustrate rental values by unit and by office, with dot sizes representing desk or people capacity. The map segment highlights the spatial density of office rentals in Wandsworth, with larger and colored circles indicating higher desk capacities in specific subregions.
Insights:
Figure 6: Detailed office rental value analysis in Wandsworth, illustrating the distribution of prices and desk sizes across various offices
This project demonstrated the utility of web scraping in capturing dynamic market data for informed decision-making in office space rentals. By structuring and analyzing the data, I could assess competitive pricing strategies and identify market opportunities, particularly in high-demand areas like Westminster. Web scraping offers a practical solution for obtaining real-time market insights, though it requires adherence to ethical and legal standards.
While this project focused on London, the approach can be scaled to other cities where office rental markets are rapidly changing. Future analysis could include longitudinal studies to track market changes or compare London’s office rental trends with those in other major business hubs.