Skip to main content
Library homepage
 

Text Color

Text Size

 

Margin Size

 

Font Type

Enable Dyslexic Font
Engineering LibreTexts

2: Collecting and Preparing Data

( \newcommand{\kernel}{\mathrm{null}\,}\)

  • 2.0: Introduction
    This page emphasizes the importance of data collection and preparation in the data science cycle, highlighting their role in ensuring data quality and readiness for analysis. It involves systematic gathering and meticulous preparation to detect errors, which is essential given the increasing volume of data and complexities such as unstructured data. Mastery of these processes enables organizations to gain insights that promote business growth and efficiency.
  • 2.1: Overview of Data Collection Methods
    This page outlines the systematic process of data collection crucial for data science, emphasizing the need to understand project objectives. It discusses various data collection methods like surveys and experiments, and distinguishes between observational and transactional data, each providing different insights.
  • 2.2: Survey Design and Implementation
    This page outlines research objectives and emphasizes the importance of effective survey design, including clear objectives, structured questions, and bias avoidance. It discusses various sampling techniques (purposive, snowball, quota, and volunteer sampling) and their impact on research relevance and generalizability.
  • 2.3: Web Scraping and Social Media Data Collection
    This page provides an overview of web scraping and social media data collection, emphasizing Python techniques like web crawling, XPath, and APIs for data extraction. It introduces libraries such as Pandas, Beautiful Soup, and NLTK for data manipulation. The text also covers natural language processing with SpaCy and the use of regular expressions for text parsing.
  • 2.4: Data Cleaning and Preprocessing
    This page discusses the significance of data cleaning and preprocessing in data science, highlighting processes such as data integration, transformation, and validation. It emphasizes the need to handle missing data and outliers and outlines techniques like imputation and robust statistical methods to maintain data integrity.
  • 2.5: Handling Large Datasets
    This page discusses strategies to enhance patient outcomes and reduce costs through effective data management techniques such as data compression, indexing, and database systems. It highlights the importance of cloud computing for efficient data storage and collaboration, particularly for large insurance companies, offering solutions like cloud migration, data archiving, and hybrid cloud approaches.
  • 2.6: Key Terms
    This page offers detailed definitions and explanations of key concepts in data processing, including APIs, big data, cloud computing, and techniques such as data compression and normalization. It addresses different database types (relational and NoSQL), data quality challenges (like missing data and measurement errors), as well as data aggregation and transformation methods.
  • 2.7: Group Project
    This page outlines two main projects: Project A focuses on collecting and analyzing data on extinct species due to climate change, involving various research and data management tasks. Project B aims to record and analyze 24-hour temperature changes at a specific location, with standardized observation conditions to detect patterns and discuss results.
  • 2.8: Critical Thinking
    This page discusses various data collection and analysis methodologies across different contexts, including surveys on cafeteria food opinions, city budget perceptions, and VR gaming center interest. It also emphasizes the importance of cloud computing in managing high school football data, noting its scalability, real-time analysis capabilities, and integration with other software to enhance training and strategy.
  • 2.9: References
    This page references two academic works: one by Elfil and Negida (2017), focusing on sampling methods in clinical research, and another by Lusinchi (2012), examining the effects of automobile and telephone ownership on the accuracy of the 1936 Literary Digest poll. Both are published in respective journals.


This page titled 2: Collecting and Preparing Data is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform.

Support Center

How can we help?