Scholarly record
DATA COLLECTION METHODOLOGY FOR ARTIFICIAL INTELLIGENCE MODELS IN SOCIALLY RELEVANT AREAS
Abstract
Data collection and storage are crucial in the development of artificial intelligence (AI) within the socially relevant areas. Despite the increasing volume of information, it is often scattered across various sources, formats, and platforms, making effective analysis and application challenging. The lack of a unified framework for data integration further complicates efforts to leverage AI for informed decision-making, policy formulation, and public service optimisation. This article proposes a structured methodology for data collection from publicly available information (PAI) aimed at providing reliable, accessible, and compatible data for AI models. The methodology encompasses the identification of key social sectors for data collection. It also explores automated and semi-automated data collection techniques, such as API integrations, web scraping, crowdsourced data acquisition, and survey methodologies. Additionally, the article highlights the importance of data validation, normalisation, and anonymisation to ensure accuracy, consistency, and compliance with regulatory requirements such as GDPR, and the AI Act.
Publication Impact Profile
Publication details
References9
Fang J., Zhao L., Li Sh. Exploring open government data ecosystems across data, information, and business. Government Inf. Quarterly, vol. 41(2):101934, 2024. DOI: 10.1016/j.giq.2024.101934
Rodr�guez-Mazahua N, Rodr�guez-Mazahua L, L�pez-Chau A, Alor-Hern�ndez G, Machorro-Cano I. Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses. Applied Sciences. 12(21):10942, 2022. DOI: 10.3390/app122110942
Tveita L. J., Hustad E. Benefits and Challenges of Artificial Intelligence in Public sector: A Literature Review, Procedia Computer Science, 256, pp.222�229, 2025. DOI: 10.1016/j.procs.2025.02.115
Dineva, K., Atanasova, T. Methodology for data processing in Modular IoT system. Distributed computer and communication networks 2019, vol.11956, pp.457-468, 2019. DOI: 10.1007/978-3-030-36614-8_35
Todorov, K. Copyright aspects of regulating artificial intelligence. Intellectual property and business magazine, issue 5, pp.30 � 49, 2024.
Chapagain, A. Hands-On Web Scraping with Python: Extract quality data from the web using effective Python techniques 2nd Edition, Packt Publishing pp.145-188, 2023.
Greca, S., Kosta, A., Maxhelaku, S. Optimizing data retrieval by using MongoDb with Elasticsearch. CEUR, vol.2280, pp.1-6, 2018.
Narayanan, P. Orchestrating data engineering pipelines using apache airflow. Data engineering for machine learning pipelines, Apress, Berkeley, CA, pp.383-413, 2024. DOI: 10.1007/979-8-8688-0602-5_12
Patchipala, S., Data anonymization in AI and ML engineering: Balancing privacy and model performance using Presidio. Iconic Res. and Eng. J., vol.6(10), pp.992�1003, 2023.
View or Download full articleAccess options
SWS access login
Login as SWS Scientific CommitteeLogin as SWS Scientific PartnerLogin as SWS AuthorAuthors and approved SWS contributors will read and export their own linked papers after identity matching by SWS profile, email and SGEM GlobalID.
For librarian assistance: [email protected]
Purchase Instant Access
- Article can be downloaded after successful payment.
- Article may be used according to SWS library access terms.
- Article cannot be redistributed.

