Computer Science
Understanding Data

UNDERSTANDING DATA: COMPLETE CUET STUDY NOTES


1. Introduction to Data and its Purpose


In the modern era, data is considered the bedrock of decision-making. In general terms, data is a collection of characters, numbers, and other symbols that represent values of specific situations or variables. While the word "data" is plural, its singular form is "datum".


The Purpose of Data: We rarely make important decisions without looking at relevant information. For example:


  • Education: When choosing a college, students analyze placement data of previous years, faculty qualifications, fees, and hostel facilities.

  • Governance: Governments perform a census to collect population data, which is essential for planning and formulating public policies.

  • Sports: Coaching staff analyze the previous performances of opponent teams to develop winning strategies.

  • Banking: Banks maintain data regarding customer accounts and transactions to manage withdrawals and deposits accurately.

  • Business: Companies monitor market behavior, identify customer demands through feedback, and use sales data to decide on promotions like "happy hours".


The ICT Revolution: The revolution in Information and Communication Technology (ICT), led by computers, mobile devices, and the Internet, has led to the generation of a large volume of data at a very fast pace. This includes everything from signals generated by sensors and satellite data to social media posts and images.


2. Data Collection, Storage, and Organization


2.1 Data Collection

Before data can be processed, it must be gathered from appropriate sources. Data collection involves identifying already available data or collecting new data from sources like grocery store sales, hospital patient records, or social media platforms. For example, a political analyst might collect data from social media posts to gauge public opinion before an election.


2.2 Data Storage

Once data is collected, it is stored on digital devices for future use. While the high rate of data generation makes storage a challenge, the decreasing cost of storage devices has simplified the task.


  • Common Storage Devices: Hard Disk Drives (HDD), Solid State Drives (SSD), CD/DVD, Pen Drives, and Memory Cards.

  • Data Files: Data like images and documents are stored as files. While files are useful, complex data management often requires a Database Management System (DBMS) to overcome the limitations of simple file processing.


2.3 Data Organization: Types of Data


Data comes in different formats depending on its source. It is broadly classified into two categories:

A. Structured Data: Data that is organized and recorded in a well-defined format is called structured data.


  • Tabular Format: It is usually stored in rows and columns.

  • Attributes: Each column represents a different parameter (attribute/characteristic).

  • Observations: Each row represents a single observation.

  • Example: An inventory list of kitchen items with columns for ModelNo, ProductName, UnitPrice, and Quantity.

B. Unstructured Data: Data that does not follow a traditional row-and-column structure is called unstructured data.

  • Examples: Newspaper articles (which change layout daily), the content of an email, web pages with multimedia, and social media messages.

  • Metadata: Unstructured data is often described using "data about data," known as metadata. For instance, an image file’s metadata might include its size, type (JPEG/PNG), and resolution.


3. The Data Processing Cycle

Data in its raw form (unorganized facts) is not information. It must be processed to generate a meaningful result.


Steps in the Data Processing Cycle:

  1. Input:

    • Data Collection: Gathering raw facts.

    • Data Preparation: Organizing data for entry.

    • Data Entry: Feeding data into the system.

  2. Processing:

    • Store/Retrieve: Managing data in memory.

    • Classify/Update: Organizing data into categories or modifying existing records.

  3. Output:

    • Reports/Results: Generating the final information in the form of tables, charts, or text.


Automated Examples:

  • ATM Withdrawal: The system checks the PIN and balance, deducts the amount, and prints a receipt.

  • Exam Registration: A website processes student details, verifies eligibility, generates a roll number, and issues an admit card.


4. Understanding Data using Statistical Methods


Statistical techniques are used to summarize tabular data for easier comprehension. These are divided into Measures of Central Tendency and Measures of Variability.


4.1 Measures of Central Tendency

These provide a single value that gives an idea about the "center" or "average" of the data set.


A. Mean (Average):


  • Definition: The sum of all values divided by the total number of values.


  • Formula: xˉ=nxi


  • Example: For student heights , the mean is the sum divided by 9.


  • Sensitivity: Mean is not suitable if there are outliers (exceptionally large or small values) as they can heavily influence the average.


B. Median:

  • Definition: The middle value when data is sorted in ascending or descending order.

  • Calculation:

    • If n is odd, the median is the value at the middle position.

    • If n is even, the median is the average of the two middle values.

  • Example: Sorting the heights: . The 5th position (102 cm) is the median.

  • Advantage: It represents the actual central value and is less sensitive to outliers than the mean.


C. Mode:

  • Definition: The value that appears most frequently in the data set.

  • Properties: A data set can have no mode (if all values appear once) or multiple modes. It can be used for both numeric and non-numeric data.

  • Example: In the height list, 110 appears three times, so it is the mode.


4.2 Measures of Variability (Dispersion)


These measures refer to the spread or variation of the values around the mean. Two data sets can have the same mean but very different levels of dispersion.


A. Range:

  • Definition: The difference between the maximum and minimum values.

  • Formula: Range=Maximum−Minimum.

  • Example: Max height 115 cm - Min height 85 cm = 30 cm.

  • Limitation: Like the mean, the range is badly influenced by outliers because it only considers the two extreme values.


B. Standard Deviation (σ):

  • Definition: The positive square root of the average of the squared differences of each value from the mean.

  • Interpretation: A smaller standard deviation means data points are close to the mean (less spread); a larger value means data is more spread out.

  • Calculation:


Calculate the Mean (xˉ).

Subtract the mean from each value (xxˉ).

Square each result (xxˉ)2.

Find the sum of these squares.

Divide by the number of values (n).

Take the square root of the result.


C. Variance: While the term "Variance" is often paired with Standard Deviation, the sources describe the average of squared differences (which is the variance) as the value that is then square-rooted to find the standard deviation. Mathematically, Variance=σ2.


5. Data Interpretation and Application


Understanding which statistical technique to apply is crucial for correct interpretation.

Problem Statement

Suitable Statistical Method

Disparity in salaries among employees

Standard Deviation (measures spread/disparity)

Average performance of a class in a test

Mean

Finding the dominant/most frequent value

Mode

Comparing heights/incomes of two cities

Mean or Median

Determining a popular car color

Mode (works for non-numeric data)


Data Interpretation in Practice: Looking at a raw list of 2000 salary packages tells us nothing. A placement cell must process and analyze this data to present summaries and visuals in a brochure so that prospective students can arrive at a conclusion. 


Similarly, pharmaceutical companies record and interpret data during clinical trials to determine a medicine's effectiveness.