terracommunis

Scraping Certified BREEAM Assessments Data

Sustainability has become a paramount concern in contemporary architecture and construction, driving the adoption of various frameworks to evaluate and enhance the environmental performance of buildings. Among these, the Building Research Establishment Environmental Assessment Method (BREEAM) stands out as a leading international certification system, particularly influential in the UK and Europe. This study leverages advanced data science techniques to scrape, process, and analyze approximately 40,000 BREEAM-certified assessments from the GreenBookLive database. The objective is to uncover temporal and spatial trends, sectoral distributions, and regional concentrations of certified buildings. By doing so, the research provides actionable insights for architects, developers, and policymakers, supporting informed decision-making aligned with the UK’s ambitious net-zero targets.

Executive Summary

Key findings indicate a significant increase in high-level certifications (“Excellent” and “Outstanding”) over the past five years, with the UK and the Netherlands leading certification efforts. The geographic distribution of BREEAM certifications reveals these countries’ pivotal role in driving sustainability standards within Europe, while adoption is increasing in regions such as Asia (e.g., China) and the Middle East (Figure 1). Emerging clusters in the Netherlands reflect its strong commitment to sustainability, while the commercial and residential sectors dominate certifications across other regions. These insights underscore the importance of targeted policies and incentives to promote sustainable practices in underrepresented areas, contributing to the broader goal of environmental sustainability in the built environment

BREEAM Outstanding Certified Assessments Figure 1: Global map showing BREEAM-certified project locations, colour-coded by use class (e.g., Residential, Commercial).

Methodology

The study employs a comprehensive methodology encompassing data extraction, cleaning, analysis, and visualization to achieve its objectives. The following sections outline the detailed processes involved:

1. Data Collection

Web-Scraping Process

The data extraction process involved sophisticated web scraping techniques with Python to retrieve BREEAM certification data from the GreenBookLive database. Given the website’s dynamic nature, a combination of HTTP requests and browser automation via Selenium was employed to effectively interact with and extract data from dynamically loaded content. Selenium is a web automation tool that allows programmatic interaction with web browsers, simulating user actions such as clicking and scrolling, which is essential for scraping dynamic websites.

Data Description

The dataset comprises information on BREEAM-certified buildings, detailing certifications, ratings, locations, and other attributes. Below is the detailed schema of the dataset:

Field Name Data Type Description
Building/Asset Name String Name of the certified building or asset
Client/Developer String Name of the client or developer
Scheme String Certification scheme applied to the building (e.g., In-Use International)
Rating/Score String Descriptive rating and score for the building (e.g., Very good 58.7%)
Rating String Final rating achieved (e.g., Very good, Good, Pass)
Score Float Numerical score achieved in the certification (e.g., 58.70%)
Stage/Valid Until Date Date indicating until when the certification is valid
Certification Number String Unique certification number assigned to the building
Assessor/Auditor String Name of the assessor or auditor responsible for the certification
Town/Postcode/Zipcode String Town and postcode/zipcode where the building is located
Country String Country where the building is located
NSO String National Scheme Operator overseeing the certification
Other Information String Additional information regarding the asset or project
Project Type String Type of project (e.g., Offices, Retail, Industrial)
Rating (%) Float Certification rating expressed as a percentage
Latitude Float Geographical latitude of the building
Longitude Float Geographical longitude of the building

Table 1: Data Schema of BREEAM-Certified Buildings Dataset.

Python Libraries:

- `requests` for handling HTTP requests and fetching static content.
- `lxml` for parsing HTML and extracting data fields.
- `Selenium` with the Chrome WebDriver for automating browser interactions and handling JavaScript-rendered content.
- `Pandas` for data cleaning, manipulation, and analysis.
- `Folium` and `GeoPandas` for geospatial analysis and visualization.

Web Scraping Implementation:

Implementation Steps:

  1. Selenium Configuration: Selenium WebDriver was configured with headless Chrome options to automate browser interactions without a graphical interface.
  2. Page Navigation: Automated scripts navigated through multiple pages of the GreenBookLive database, handling pagination controls to ensure comprehensive data collection.
  3. Data Extraction: Relevant data fields such as building name, location, certification date, certification level, and assessor information were identified and extracted from the HTML content using lxml.
  4. Data Storage: The extracted data was initially stored in CSV files for intermediate storage and later imported into Pandas DataFrames for further processing.

2. Data Cleaning & Processing

Once the raw data was collected, extensive preprocessing was necessary to ensure accuracy and consistency, enabling reliable analysis.

Data Cleaning:

Categorization:

3. Data Analysis and Insights

Exploratory Data Analysis was conducted to uncover patterns, trends, and insights within the BREEAM certifications dataset.

BREEAM Outstanding Certified Assessments Figure 2: Pie chart depicting the number of office BREEAM certifications in the UK over each year from 2013 to 2022 colour-code by rating.

Sectoral and Regional Distribution:

The bar charts illustrate the number of “Outstanding” BREEAM certifications in the Office and Industrial sectors over time. The charts break down certifications by country, providing clear insights into the trends across different regions and building types.

Offices: Outstanding BREEAM Certifications by Year and Country

BREEAM Outstanding Certified Assessments Figure 3: Bar chart showing the number of “Outstanding” certifications in the Office sector.

Industrial: Outstanding BREEAM Certifications by Year and Country

BREEAM Outstanding Certified Assessments Figure 4: Bar chart showing the number of “Outstanding” certifications in the Industrial sector.

Sectoral and Geospatial Analysis

The data was segmented by building type and region to uncover sector-specific trends and geospatial patterns. The analysis focused on commercial, residential, industrial, and mixed-use buildings across different regions, allowing for detailed insights into how sustainability certifications vary by sector and geography.

Interactive maps were generated using Folium and GeoPandas to display the density and spatial distribution of BREEAM-certified buildings. Commercial buildings in London and the South East demonstrated particularly high certification rates.

BREEAM Outstanding Certified Assessments
Figure 5: Detailed map of London showing BREEAM-certified project locations, color-coded by use class.

The distribution of certifications across the UK and Europe was further explored, revealing sector-specific trends. The maps showed that while London and the South East lead in commercial certifications (Figure 6), Scotland, the North West, and regions across Europe have seen increasing certifications in residential, mixed-use, and industrial projects.

BREEAM Outstanding Certified Assessments
Figure 6: Detailed map of Europe and the UK showing BREEAM-certified project locations, colour-coded by use class.

4. Cluster Analysis with Machine Learning

To gain deeper insights into the relationships between building type, certification levels, and geographic distribution, unsupervised machine learning techniques such as K-Means Clustering and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) were applied. These methods helped reveal patterns and form distinct groups within the dataset based on features such as certification level, location, and building type.

BREEAM Outstanding Certified Assessments Figure 7: Clustering of BREEAM-Certified Projects Across Europe and the UK Using K-Means and DBSCAN, Color-Coded by Use Class

K-Means Clustering:

Objective:
The goal of K-Means clustering was to group buildings based on their certification rating, type, and geographic location. By identifying distinct clusters, the analysis aimed to reveal patterns in certification performance and geographic concentration.

Feature Engineering:
Key features used in the clustering model included:

These features were standardized to ensure uniformity across the dataset, ensuring that no single feature disproportionately influenced the clustering results.

Cluster Determination:
The optimal number of clusters (k) for K-Means was identified using several techniques:

  1. Elbow Method: This method plots the total within-cluster sum of squares (inertia) against different values of k. The optimal number of clusters is determined by identifying the “elbow point,” where the graph starts to flatten, indicating diminishing returns with additional clusters.

Elbow Method

Figure 8: Elbow Method for Determining Optimal Number of Clusters (k).

  1. Silhouette Scores: This metric measures how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, with higher values indicating better-defined clusters. It was used to validate the optimal number of clusters by measuring cluster cohesion and separation.

Silhouette Score

Figure 9: Silhouette Score indicating the cohesion and separation of clusters.

  1. Calinski-Harabasz Index (Variance Ratio Criterion): This index considers the ratio of within-cluster dispersion to between-cluster dispersion. Higher values indicate better-defined clusters.

Calinski Harabasz Score

Figure 10: Calinski-Harabasz Index representing cluster compactness.

  1. Davies-Bouldin Index: This metric evaluates cluster separation by comparing the average distance between clusters with the size of clusters themselves. Lower values of the Davies-Bouldin index indicate better separation between clusters.

These methods helped ensure meaningful and distinct groupings of the buildings based on their certification, type, and location.

Davies Bouldin Score

Figure 11: Davies-Bouldin Index measuring cluster separation.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Objective:
While K-Means is useful for partitioning data into a fixed number of clusters, DBSCAN was applied to identify clusters of varying density, particularly useful for spatial data. It allowed for the discovery of clusters without needing to predefine the number of clusters, making it well-suited for detecting patterns in geographic data.

Parameter Tuning:

Noise Points vs eps

Figure 12: Noise Points vs eps for Different min_samples in DBSCAN.

Number of Clusters vs eps

Figure 13: Number of Clusters vs eps for Different min_samples in DBSCAN.

Evaluation of Clusters:

Inertia:
For K-Means, inertia (the sum of squared distances of samples to their closest cluster center) was minimized to ensure tight clustering. Lower inertia values correspond to better-fitting clusters but are evaluated in combination with other metrics to avoid overfitting.

Insights:
The clustering analysis provided the following key insights:

In summary, the combination of K-Means and DBSCAN, alongside the use of multiple evaluation metrics such as the Elbow Method, Silhouette Scores, Calinski-Harabasz Index, Davies-Bouldin Index, and inertia, provided robust and insightful clustering results. These insights help identify regions and sectors leading in sustainability and reveal geographic trends in building certification that can inform future urban planning and policy development.

5. Challenges and Solutions

The project involved scraping structured data from a relatively simple website, but there were a few notable challenges related to data collection and inconsistencies, particularly for non-UK/European projects.

Conclusion

This study demonstrates the efficacy of data-driven approaches in understanding the dynamics of sustainable construction through the lens of BREEAM certifications. The significant growth in high-level certifications underscores the industry’s commitment to environmental sustainability, while regional and sectoral analyses highlight areas of strength and opportunities for targeted policy interventions. As the UK progresses toward its net-zero goals, these insights provide valuable guidance for architects, developers, and policymakers in fostering sustainable building practices. Future research directions include expanding the analysis to other global sustainability frameworks, assessing the economic and human-centric impacts of certifications, and integrating Geographic Information Systems (GIS) for more nuanced spatial analyses.