Methodology

Methodology

Data Sources

Every dataset ingested by Water Utility Report, with source URLs, update cadences, data formats, and license status. Updated when ingestion pipelines change.

8

Data Sources

7/8

Public Domain

4

Refresh Cadences

5

Prohibited Sources

Core Utility DataQuarterly

EPA SDWIS — Safe Drinking Water Information System

Source

Update Cadence

Quarterly

Format

CSV / API

License

Public Domain — federal government data

Fields Ingested

Public Water System ID (PWSID)Utility nameSystem typePopulation servedViolation recordsEnforcement actions

Primary source for utility identity and compliance data. All utility pages use PWSID as the canonical system identifier.

Compliance & EnforcementFrequent

EPA ECHO — Enforcement and Compliance History Online

Source

Update Cadence

Weekly (ECHO refreshes frequently)

Format

CSV / API

License

Public Domain — federal government data

Fields Ingested

Violation detailsEnforcement actionsCompliance schedulesHealth-based violation flagsResolution dates

Used for violation severity classification and health-based vs. monitoring violation distinction.

Contaminant Detection DataAnnual

Consumer Confidence Reports (CCRs)

Source

Update Cadence

Annual (utilities publish by July 1 each year)

Format

PDF / HTML (utility-published)

License

Public record — annual reports required by SDWA

Fields Ingested

Detected contaminant levelsMCL comparisonsSource water informationTreatment summaries

Primary source for contaminant level data shown on utility pages. Ingestion requires PDF parsing; confidence score reflects parse quality.

Sampling DataFrequent

EPA Water Quality Portal

Source

Update Cadence

Continuous (portal aggregates from federal + state agencies)

Format

API / CSV

License

Public Domain — federal/state government data

Fields Ingested

Sampling event recordsLab result valuesSample location coordinatesAnalytical methods used

Used as supplemental source for contaminant data where CCR parsing yields low confidence. Cross-referenced against CCR data.

State-Level DataManual

State Drinking Water Program Datasets

Source

Update Cadence

Varies by state — annual to real-time

Format

Varies by state (CSV, API, GIS)

License

Public records — verified per state terms before ingestion

Fields Ingested

State-specific utility detailsService area boundaries (where available)State MCLs where stricter than federal

Not all states publish granular datasets. Terms verified before ingestion. California, Texas, and Florida have robust open datasets.

Health Reference DataManual

EPA and CDC Health Guidance Documents

Source

Update Cadence

As published (regulatory updates)

Format

HTML / PDF

License

Public Domain — federal government publications

Fields Ingested

MCL valuesMCLG valuesHealth effect descriptionsTreatment technique requirements

Source for all regulatory limits, MCLG values, and health-effect language on contaminant pages. Changes tracked with versioning.

Geography & PopulationAnnual

U.S. Census Bureau — TIGER/Line Shapefiles

Source

Update Cadence

Annual (TIGER updates; ACS population annual)

Format

Shapefile / GeoJSON

License

Public Domain — federal government data

Fields Ingested

ZIP code tabulation areas (ZCTAs)City/place boundariesPopulation estimatesCounty boundaries

Used for ZIP→utility matching (spatial join of ZCTA polygons with utility service area boundaries). Match confidence reflects overlap percentage.

Lab DirectoryManual

NELAP / State Lab Certification Databases

Source

Update Cadence

Manually verified — labs are not in a single machine-readable federal dataset

Format

HTML (state program pages)

License

Public record — state-published certification lists

Fields Ingested

Lab nameNELAP accreditation statusState certificationsAnalyte scopes

Lab entries are manually verified from state certification pages. Labs are flagged for re-verification annually. Not a comprehensive directory.