SWS Academic Research eLibraryEarth & Planetary Sciences

Scholarly record

A CLOUD-BASED ARCHITECTURE FOR AUTOMATED DATA EXTRACTION AND INTEGRATION FROM MULTIPLE GOVERNMENT SOURCES

Kristina Dineva, Tatiana Atanasova

First published: 2025-12-27https://doi.org/10.5593/sgem2025v/4.2/s21.96.1View metrics

Abstract

This article presents a technological approach to extract and combine data from government websites to facilitate access to disparate information and obtain valuable analysis on it in response to user queries. The study addresses the challenges of extracting information from complex government sources, which often contain regulatory documents in various formats. The lack of open access complicates data consolidation, leading to fragmented search and analysis processes. Data scraping is proposed as a solution, but it faces issues such as complex site hierarchies, diverse formats, and inconsistent data structures. The paper outlines a method to create a centralized data repository in the cloud, regularly updated for training large language models (LLMs). The process begins with a Python scraper that collects various document types, which are then processed and stored in an S3 bucket. The final output allows users to interact with an LLM interface for specific queries, significantly improving access to public information and reducing bureaucratic burdens. The architecture developed enables automated and scalable data storage from multiple government sources.

Publication Impact Profile

PlumX
No metrics available.
Dimensions ID: pub.1199508409

Publication details

Title
A CLOUD-BASED ARCHITECTURE FOR AUTOMATED DATA EXTRACTION AND INTEGRATION FROM MULTIPLE GOVERNMENT SOURCES
Authors
Kristina Dineva, Tatiana Atanasova
Proceedings
25th International Multidisciplinary Scientific GeoConference Proceedings SGEM 2025, Energy and Clean Technologies
Publisher
STEF92 Technology
Year
2025
Pages
891-898
SWS Citekey
Dineva202522883890
ISSN
1314-2704; 13142704
ISBN
9786197603934
Language
en
Publication type
Conference Paper
Proceedings contents
Open official contents
Keywords
References9
  1. Li Z., Chiang Y.-Y., Tavakkol S., Shbita B., Uhl J.H., Leyk S., Knoblock C.A., An automatic approach for generating rich, linked geo-metadata from historical map images, KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3290-3298, 2020. DOI: 10.1145/3394486.3403381

  2. Lan H. et al., COVID-Scraper: An Open-Source Toolset for Automatically Scraping and Processing Global Multi-Scale Spatiotemporal COVID-19 Records, in IEEE Access, vol. 9, pp. 84783-84798, 2021, DOI: 10.1109/ACCESS.2021.3085682.

  3. Wang D., Liu L. and Liu Y., Normalized Storage Model Construction and Query Optimization of Book Multi-Source Heterogeneous Massive Data, in IEEE Access, vol. 11, pp. 96543-96553, 2023, DOI: 10.1109/ACCESS.2023.3301134.

  4. Almaqbali, I. S. H., Al Khufairi, F. M. A., Khan, M. S., Bhat, A. Z., & Ahmed, I. (2020). Web Scrapping: Data Extraction from Websites, Journal of Student Research. DOI: 10.47611/jsr.vi.942

  5. Zhekova M. and Yumer E., JavaScript Web Scraping Tool for Extraction Information from Agriculture Websites, BIO Web Conf., vol. 102, 2024, 70th Scientific Conference with International Participation FOOD SCIENCE, ENGINEERING AND TECHNOLOGY 2023, DOI: 10.1051/bioconf/202410203008

  6. Hruby G. W., McKiernan J., Bakken S., Weng Ch., A centralized research data repository enhances retrospective outcomes research capacity: a case report, Journal of the American Medical Informatics Association, vol. 20, issue 3, May 2013, Pages 563 567, DOI: 10.1136/amiajnl-2012-001302

  7. Nundloll V., Smail R., Stevens C., Blair G., Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science, Heliyon, vol. 8, issue 10, October 2022, e10710, DOI: 10.1016/j.heliyon.2022.e10710

  8. Dineva, K., Atanasova, T., Cloud Services Providers Evaluation Model for Designing High Performance, Real-Time IoT Big Data Solutions. 8th SWS CONFERENCE ON SOCIAL SCIENCES (ISCSS), (Digital Society and HealthCare section) of the International Scientific Conference on Social Sciences ISCSS, 7-10 December 2021, SGEM2021, Vienna, Austria, 2021, pp. 721-733, DOI: 10.35603/sws.iscss.vg2021/s13.68

  9. Boisrond P.D., A Position Paper on Amazon Web Services (AWS) Simple Storage Service (S3) Buckets, August 2021, DOI: 10.13140/RG.2.2.17727.84640

View or Download full articleAccess options
Full paper accessChoose SWS login, librarian support, or instant article download.

SWS access login

Login as SWS Scientific Committee

Authors and approved SWS contributors will read and export their own linked papers after identity matching by SWS profile, email and SGEM GlobalID.

For librarian assistance: [email protected]

Purchase Instant Access

48-hour online accessComing soon
Online-only accessComing soon
Download the full article in PDF formatEUR 35
  • Article can be downloaded after successful payment.
  • Article may be used according to SWS library access terms.
  • Article cannot be redistributed.
Get full paper

Back to publication list