Dataset Search and Resources §
This is a collection for resources. Some of these could be useful also for OSINT research, but i mainly researched and aggregated these for Data Science.
Mental Framework before picking any dataset: 1
- What’s the end goal? Visualization, prediction, or just processing practice?
- How messy can I tolerate? Cleaning is a skill, but not every project needs to be a cleaning marathon
- Is the data interesting enough to sustain curiosity? The best projects answer a question you actually care about
- What’s the license? Especially relevant if you’re publishing or building something commercial
Centralized Aggregation and Major Data Hubs §
| Name | Direct Url | Data Type | Specialty | |
|---|
| Google Dataset Search | datasetsearch.research.google.com | Meta-Index | Universal discovery across all academic and governmental domains | |
| Kaggle Datasets | datasets | Tabular, Image, Text | Curated datasets for AI training and community-vetted social stats | |
| Hugging Face Datasets | datasets | Text, Audio, Multimodal | Primary hub for Natural Language Processing (NLP) and LLM training | |
| Github Data | dataset | Varied | Real-time, developer-maintained repositories and version-controlled data | |
| UCI Machine Learning Repository | ml | Varied | Historical benchmarks for algorithm testing and validation | |
| DataHub | datahub.io | Tabular, JSON | High-quality core datasets including GDP, climate, and finance | |
| Zenodo | zenodo.org | Scientific, Multimodal | CERN-developed platform for open-access research data and large files | |
| Figshare | figshare.com | Academic, Multimedia | Visualizations, figures, and supplemental research outputs | |
| Dryad | datadryad.org | Biological, Medical | Curated repository for evolutionary, genetic, and ecological data | |
| AWS Open Data | registry.opendata.aws | Cloud-Native, Geospatial | Petabyte-scale datasets optimized for cloud computation | |
Essentially, you either asks here for niche datasets that are often unindexed or require community verification.
Awesome Lists §
Reddit and Lemmy §
| Community Name | Platform/URL | Specialty |
|---|
| r/datasets | datasets | Community-driven sourcing for specific research needs |
| r/DataHoarder | DataHoarder | Large-scale archival, P2P links, and preservation tips |
| r/opendirectories | opendirectories | Discovery of exposed servers containing diverse datasets |
| mander.xyz (Lemmy) | mander.xyz | Science focus lemmy community |
| r/OSINT | OSINT | Tools and techniques for finding hidden public data |
Torrents §
qBittorrent is the suggested software for downloading these datasets.
Internet Archive §
The Internet Archive contains petabytes of public domain material accessible via qBittorrent. For any item, a torrent file is automatically generated and accessible by appending ?format=Archive+BitTorrent to the search query.
Academic Torrents §
| Research Torrent Indices | URL/DOI | Data Type | Size/Scope |
|---|
| Crossref 2026 Metadata | nggf-vt1j | JSON-Lines | 180M records (208 GB compressed) |
| GHTorrent | ghtorrent.org | SQL/Mirror
| Offline mirror of historical GitHub activity |
| Photogrammetry Trench Models | | 3D/OBJ | |
| Experiemental Archaeology | | CSV | |
| | | |
Open Directories and grey-legal data §
Censys and Shodan §
A common technique in Chensys for discovering data is:
- The “open-dir” Label: Searching labels:open-dir returns over 450,000 active open directories.
- Suspicious-Open-Dir: The
labels:suspicious-open-dir query filters these results to roughly 1% of the total, focusing on directories containing executables, C2 logs, or leaked credentials.
- Response Body Pivoting: By searching
services.http.response.body:".csv", researchers can find servers currently serving raw data files.
Google Dorking for hidden files §
| Dork Query | Target Information | Ease of Use |
|---|
intitle:"index of" "parent directory".csv | Finds open directories serving CSV files | 1 |
filetype:sql "backup" "internal" | Locates exposed SQL database backups | 1 |
intext:"aws_access_key_id" filetype:env | Finds sensitive AWS configuration keys | 2 |
site:gov filetype:pdf "2026 census" | Finds unlinked government reports | 1 |
inurl:api inurl:schema | Discovers exposed API documentation and schemas | 3 |
intitle:"index of /" site:edu | Finds open research directories at universities | 1 |
FTP Servers §
University and government agencies use these to store astronomical, genomic, and meteorological data that is too large or too legacy for modern web interfaces.
| Institution/Organization | Direct URL | Data Type | Ease of Use | Specialty |
|---|
| NCBI (Health/Genomics) | ftp.ncbi.nlm.nih.gov | Genomic/Protein | 4 | Primary global repository for sequence data |
| NOAA (Climate/Weather) | ftp.ncep.noaa.gov | Meteorological | 4 | Real-time weather models and climate archives |
| CDC (Public Health) | ftp.cdc.gov | Tabular/CSV | 3 | US public health stats and vital records |
| SUNET (Academic Mirror) | ftp.sunet.se | Varied | 2 | Swedish University Network mirror of 53TB of data |
| Harvard Dataverse | dataverse.harvard.edu | Multi-topic | 1 | Peer-reviewed social and economic datasets |
| Stanford Genome | yeastgenome.org (FTP) | Genetic | 3 | Candida and yeast genomic databases |
| NASA Earthdata | ncei.noaa.gov/pub | Geospatial | 4 | Satellite imagery and ocean measurements |
| EMBL-EBI (Bioinformatics) | ftp.ebi.ac.uk | Biological | 4 | 53TB of molecular biology data from Europe |
| Smithsonian Institution | globalvolcano.si.edu | Geological | 2 | Global volcano and eruption database |
| EPA (Environment) | gaftp.epa.gov | Environmental | 4 | 4.8TB of air quality and landfill data |
Institutional Repositories §
European Union & National Data Portals §
| Name & Direct URL | Data Type | Ease of Use | Specialty |
|---|
| data.europa.eu | Meta-Index | 1 | Central hub harvesting data from all 27 EU Member States and EFTA countries |
| Eurostat Database | Tabular (TSV, CSV, SDMX) | 3 | Official EU socio-economic statistics; allows bulk downloads and API access |
| Copernicus Data Space Ecosystem | Geospatial, Image | 4 | Real-time and historical Sentinel satellite missions (1, 2, 3, 5P) for Earth observation |
| European Parliament Data | Legal, Text | 2 | Verbatim reports, plenary speeches, and meeting agendas (updated for 2026) |
| European Social Survey | Survey, Social | 2 | Academic-grade data on European attitudes, trust in politicians, and loneliness by age group |
| dane.gov.pl (Poland) | Multi-sector | 1 | Top-rated for usability (100% score in 2024); includes deep 2026 health and science data |
| data.gov.be (Belgium) | XML, CSV, Geo | 2 | Federal Belgian data; strong for cadastral parcels, road accidents, and address (BeSt) data |
| data.overheid.nl (Netherlands) | Tabular, Metadata | 2 | Daily harvested registry of 150+ Dutch government organizations |
| data.gv.at (Austria) | Tabular, Statistics | 2 | Central access point for machine-readable Austrian official statistical data |
| GovData (Germany) | Metadata | 2 | Central metadata portal for German federal and state governments |
International Economic and Development Portals §
| Institution | Direct URL | Data Type | Ease of Use | Specialty |
|---|
| World Bank Open Data | data.worldbank.org | Development | 1 | Global health, poverty, and education stats |
| IMF Data Portal | data.imf.org | Financial | 2 | Macroeconomic stats, WEO, and COFER surveys |
| WTO Stats Portal | data.wto.org | Trade | 2 | Global merchandise and services trade flows |
| WTO BaTiS | wto_sts | Balanced Trade | 3 | Reconciled picture of bilateral services trade |
| HDX (Humanitarian) | data.humdata.org | Humanitarian | 1 | Crisis-response data (e.g., Zika, earthquakes) |
| OECD Data | data.oecd.org | Economic | 2 | Standardized indicators for developed nations |
Global Health and Demographics §
| Institution | Direct URL | Data Type | Ease of Use | Specialty |
|---|
| WHO GHO | gho | Health Stats | 2 | Health indicators for 194 Member States |
| GHDx (IHME) | ghdx.healthdata.org | Survey/Census | 3 | Comprehensive catalog of global health data |
| UN Population Portal | dataportal | Demographic | 4 | API-accessible demographic projections (1950-2030) |
| UNdata | data.un.org | Multi-topic | 2 | Unified search for UN system statistics |
| UN CEB Secretariat | data-download | Finance/HR | 3 | CSV downloads for UN system operations data |
Agricultural and Scientific Niche Repositories §
Specialized science portals often require the highest degree of technical knowledge.
| Domain | Name & URL | Ease of Use | Specialty |
|---|
| Agriculture | faostat) | 2 | Global food, crop, and livestock statistics |
| Food Security | en) | 3 | Food and Agriculture Microdata Catalogue |
| Climate | Copernicus CDS (cds.climate.copernicus.eu) | 3 | Sentinel satellite data and climate monitoring |
| Weather | data) | 2 | High-resolution climate records for the Pacific |
| Energy | US Energy Grid (Search via Bytewax) | 5 | Real-time US energy grid and flood alerts |
| Physics/Gen | CERN Zenodo (zenodo.org) | 2 | Digital repository for all scientific fields |