Scholarly record
A CLOUD-BASED ARCHITECTURE FOR AUTOMATED DATA EXTRACTION AND INTEGRATION FROM MULTIPLE GOVERNMENT SOURCES
Abstract
This article presents a technological approach to extract and combine data from government websites to facilitate access to disparate information and obtain valuable analysis on it in response to user queries. The study addresses the challenges of extracting information from complex government sources, which often contain regulatory documents in various formats. The lack of open access complicates data consolidation, leading to fragmented search and analysis processes. Data scraping is proposed as a solution, but it faces issues such as complex site hierarchies, diverse formats, and inconsistent data structures. The paper outlines a method to create a centralized data repository in the cloud, regularly updated for training large language models (LLMs). The process begins with a Python scraper that collects various document types, which are then processed and stored in an S3 bucket. The final output allows users to interact with an LLM interface for specific queries, significantly improving access to public information and reducing bureaucratic burdens. The architecture developed enables automated and scalable data storage from multiple government sources.
Publication Impact Profile
Publication details
References9
Li Z., Chiang Y.-Y., Tavakkol S., Shbita B., Uhl J.H., Leyk S., Knoblock C.A., An automatic approach for generating rich, linked geo-metadata from historical map images, KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3290-3298, 2020. DOI: 10.1145/3394486.3403381
Lan H. et al., COVID-Scraper: An Open-Source Toolset for Automatically Scraping and Processing Global Multi-Scale Spatiotemporal COVID-19 Records, in IEEE Access, vol. 9, pp. 84783-84798, 2021, DOI: 10.1109/ACCESS.2021.3085682.
Wang D., Liu L. and Liu Y., Normalized Storage Model Construction and Query Optimization of Book Multi-Source Heterogeneous Massive Data, in IEEE Access, vol. 11, pp. 96543-96553, 2023, DOI: 10.1109/ACCESS.2023.3301134.
Almaqbali, I. S. H., Al Khufairi, F. M. A., Khan, M. S., Bhat, A. Z., & Ahmed, I. (2020). Web Scrapping: Data Extraction from Websites, Journal of Student Research. DOI: 10.47611/jsr.vi.942
Zhekova M. and Yumer E., JavaScript Web Scraping Tool for Extraction Information from Agriculture Websites, BIO Web Conf., vol. 102, 2024, 70th Scientific Conference with International Participation FOOD SCIENCE, ENGINEERING AND TECHNOLOGY 2023, DOI: 10.1051/bioconf/202410203008
Hruby G. W., McKiernan J., Bakken S., Weng Ch., A centralized research data repository enhances retrospective outcomes research capacity: a case report, Journal of the American Medical Informatics Association, vol. 20, issue 3, May 2013, Pages 563 567, DOI: 10.1136/amiajnl-2012-001302
Nundloll V., Smail R., Stevens C., Blair G., Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science, Heliyon, vol. 8, issue 10, October 2022, e10710, DOI: 10.1016/j.heliyon.2022.e10710
Dineva, K., Atanasova, T., Cloud Services Providers Evaluation Model for Designing High Performance, Real-Time IoT Big Data Solutions. 8th SWS CONFERENCE ON SOCIAL SCIENCES (ISCSS), (Digital Society and HealthCare section) of the International Scientific Conference on Social Sciences ISCSS, 7-10 December 2021, SGEM2021, Vienna, Austria, 2021, pp. 721-733, DOI: 10.35603/sws.iscss.vg2021/s13.68
Boisrond P.D., A Position Paper on Amazon Web Services (AWS) Simple Storage Service (S3) Buckets, August 2021, DOI: 10.13140/RG.2.2.17727.84640
View or Download full articleAccess options
SWS access login
Login as SWS Scientific CommitteeLogin as SWS Scientific PartnerLogin as SWS AuthorAuthors and approved SWS contributors will read and export their own linked papers after identity matching by SWS profile, email and SGEM GlobalID.
For librarian assistance: [email protected]
Purchase Instant Access
- Article can be downloaded after successful payment.
- Article may be used according to SWS library access terms.
- Article cannot be redistributed.

