Dataset Search and Resources

This is a collection for resources. Some of these could be useful also for OSINT research, but i mainly researched and aggregated these for Data Science.

Mental Framework before picking any dataset: 1

  • What’s the end goal? Visualization, prediction, or just processing practice?
  • How messy can I tolerate? Cleaning is a skill, but not every project needs to be a cleaning marathon
  • Is the data interesting enough to sustain curiosity? The best projects answer a question you actually care about
  • What’s the license? Especially relevant if you’re publishing or building something commercial

Centralized Aggregation and Major Data Hubs

NameDirect UrlData TypeSpecialty
Google Dataset Searchdatasetsearch.research.google.comMeta-IndexUniversal discovery across all academic and governmental domains
Kaggle DatasetsdatasetsTabular, Image, TextCurated datasets for AI training and community-vetted social stats
Hugging Face DatasetsdatasetsText, Audio, MultimodalPrimary hub for Natural Language Processing (NLP) and LLM training
Github DatadatasetVariedReal-time, developer-maintained repositories and version-controlled data
UCI Machine Learning RepositorymlVariedHistorical benchmarks for algorithm testing and validation
DataHubdatahub.ioTabular, JSONHigh-quality core datasets including GDP, climate, and finance
Zenodozenodo.orgScientific, MultimodalCERN-developed platform for open-access research data and large files
Figsharefigshare.comAcademic, MultimediaVisualizations, figures, and supplemental research outputs
Dryaddatadryad.orgBiological, MedicalCurated repository for evolutionary, genetic, and ecological data
AWS Open Dataregistry.opendata.awsCloud-Native, GeospatialPetabyte-scale datasets optimized for cloud computation

Community-led data discovery

Essentially, you either asks here for niche datasets that are often unindexed or require community verification.

Awesome Lists

List NameURLDomainSpecialty
Awesome Open Dataawesome-open-dataGeneralHigh-quality portals across all domains
Awesome Public Datasetsawesome-public-datasetsMulti-topicTopic-centric data from agriculture to networks
Awesome Real-Time Datasetsawesome-public-real-time-datasetsLiveFinance, Transportation, and IoT streams
Awesome Legal NLPawesome-legal-nlpLawSwiss, EU, and US court judgments
Awesome Single Cellawesome-single-cellBiologyGene regulatory networks and transcriptomics
Awesome Sports AnalyticsAwesome-Sports-AnalyticsSportsTracking data for soccer, basketball, and more
Awesome Computational Social Scienceawesome-computational-social-scienceSociologySocial influence and mobilization experiments
Awesome OSS Research Dataawesome-oss-research-dataSoftwareHistorical activity from GitHub and PyPI
Awesome Computer Visionawesome-computer-visionAI/CVStereo vision, optical flow, and image deblurring
Awesome DL4NLPawesome-dl4nlpNLPQuestion answering and word embedding sets

Reddit and Lemmy

Community NamePlatform/URLSpecialty
r/datasetsdatasetsCommunity-driven sourcing for specific research needs
r/DataHoarderDataHoarderLarge-scale archival, P2P links, and preservation tips
r/opendirectoriesopendirectoriesDiscovery of exposed servers containing diverse datasets
mander.xyz (Lemmy)mander.xyzScience focus lemmy community
r/OSINTOSINTTools and techniques for finding hidden public data

Torrents

qBittorrent is the suggested software for downloading these datasets.

Internet Archive

The Internet Archive contains petabytes of public domain material accessible via qBittorrent. For any item, a torrent file is automatically generated and accessible by appending ?format=Archive+BitTorrent to the search query.

Academic Torrents

Research Torrent IndicesURL/DOIData TypeSize/Scope
Crossref 2026 Metadatanggf-vt1jJSON-Lines180M records (208 GB compressed)
GHTorrentghtorrent.orgSQL/Mirror
Offline mirror of historical GitHub activity
Photogrammetry Trench Models3D/OBJ
Experiemental ArchaeologyCSV

Censys and Shodan

A common technique in Chensys for discovering data is:

  • The “open-dir” Label: Searching labels:open-dir returns over 450,000 active open directories.
  • Suspicious-Open-Dir: The labels:suspicious-open-dir query filters these results to roughly 1% of the total, focusing on directories containing executables, C2 logs, or leaked credentials.
  • Response Body Pivoting: By searching services.http.response.body:".csv", researchers can find servers currently serving raw data files.

Google Dorking for hidden files

Dork QueryTarget InformationEase of Use
intitle:"index of" "parent directory".csvFinds open directories serving CSV files1
filetype:sql "backup" "internal"Locates exposed SQL database backups1
intext:"aws_access_key_id" filetype:envFinds sensitive AWS configuration keys2
site:gov filetype:pdf "2026 census"Finds unlinked government reports1
inurl:api inurl:schemaDiscovers exposed API documentation and schemas3
intitle:"index of /" site:eduFinds open research directories at universities1

FTP Servers

University and government agencies use these to store astronomical, genomic, and meteorological data that is too large or too legacy for modern web interfaces.

Institution/OrganizationDirect URLData TypeEase of UseSpecialty
NCBI (Health/Genomics)ftp.ncbi.nlm.nih.govGenomic/Protein4Primary global repository for sequence data
NOAA (Climate/Weather)ftp.ncep.noaa.govMeteorological4Real-time weather models and climate archives
CDC (Public Health)ftp.cdc.govTabular/CSV3US public health stats and vital records
SUNET (Academic Mirror)ftp.sunet.seVaried2Swedish University Network mirror of 53TB of data
Harvard Dataversedataverse.harvard.eduMulti-topic1Peer-reviewed social and economic datasets
Stanford Genomeyeastgenome.org (FTP)Genetic3Candida and yeast genomic databases
NASA Earthdatancei.noaa.gov/pubGeospatial4Satellite imagery and ocean measurements
EMBL-EBI (Bioinformatics)ftp.ebi.ac.ukBiological453TB of molecular biology data from Europe
Smithsonian Institutionglobalvolcano.si.eduGeological2Global volcano and eruption database
EPA (Environment)gaftp.epa.govEnvironmental44.8TB of air quality and landfill data

Institutional Repositories

European Union & National Data Portals

Name & Direct URLData TypeEase of UseSpecialty
data.europa.euMeta-Index1Central hub harvesting data from all 27 EU Member States and EFTA countries
Eurostat DatabaseTabular (TSV, CSV, SDMX)3Official EU socio-economic statistics; allows bulk downloads and API access
Copernicus Data Space EcosystemGeospatial, Image4Real-time and historical Sentinel satellite missions (1, 2, 3, 5P) for Earth observation
European Parliament DataLegal, Text2Verbatim reports, plenary speeches, and meeting agendas (updated for 2026)
European Social SurveySurvey, Social2Academic-grade data on European attitudes, trust in politicians, and loneliness by age group
dane.gov.pl (Poland)Multi-sector1Top-rated for usability (100% score in 2024); includes deep 2026 health and science data
data.gov.be (Belgium)XML, CSV, Geo2Federal Belgian data; strong for cadastral parcels, road accidents, and address (BeSt) data
data.overheid.nl (Netherlands)Tabular, Metadata2Daily harvested registry of 150+ Dutch government organizations
data.gv.at (Austria)Tabular, Statistics2Central access point for machine-readable Austrian official statistical data
GovData (Germany)Metadata2Central metadata portal for German federal and state governments

International Economic and Development Portals

InstitutionDirect URLData TypeEase of UseSpecialty
World Bank Open Datadata.worldbank.orgDevelopment1Global health, poverty, and education stats
IMF Data Portaldata.imf.orgFinancial2Macroeconomic stats, WEO, and COFER surveys
WTO Stats Portaldata.wto.orgTrade2Global merchandise and services trade flows
WTO BaTiSwto_stsBalanced Trade3Reconciled picture of bilateral services trade
HDX (Humanitarian)data.humdata.orgHumanitarian1Crisis-response data (e.g., Zika, earthquakes)
OECD Datadata.oecd.orgEconomic2Standardized indicators for developed nations

Global Health and Demographics

InstitutionDirect URLData TypeEase of UseSpecialty
WHO GHOghoHealth Stats2Health indicators for 194 Member States
GHDx (IHME)ghdx.healthdata.orgSurvey/Census3Comprehensive catalog of global health data
UN Population PortaldataportalDemographic4API-accessible demographic projections (1950-2030)
UNdatadata.un.orgMulti-topic2Unified search for UN system statistics
UN CEB Secretariatdata-downloadFinance/HR3CSV downloads for UN system operations data

Agricultural and Scientific Niche Repositories

Specialized science portals often require the highest degree of technical knowledge.

DomainName & URLEase of UseSpecialty
Agriculturefaostat)2Global food, crop, and livestock statistics
Food Securityen)3Food and Agriculture Microdata Catalogue
ClimateCopernicus CDS (cds.climate.copernicus.eu)3Sentinel satellite data and climate monitoring
Weatherdata)2High-resolution climate records for the Pacific
EnergyUS Energy Grid (Search via Bytewax)5Real-time US energy grid and flood alerts
Physics/GenCERN Zenodo (zenodo.org)2Digital repository for all scientific fields

Footnotes

  1. where-to-find-datasets-for-your-data-science-projects-the-complete-2026-field-guide-351aaaeadb76