Introduction to Machine Learning

Imagine you’re teaching your little cousin to recognize fruits. You show them a red round fruit and say, “This is an apple.” Then you show a yellow long one and say, “This is a banana.” After seeing a few more apples and bananas, the next time you hand them a fruit they’ve never seen, they guess, “Is this an apple?” based on what they learned.

Now imagine doing the same thing with a computer—showing it lots of pictures of apples and bananas, and letting it learn the difference on its own. That’s Machine Learning! It’s like teaching a computer by example, so it can start recognizing patterns and make smart guesses, just like your cousin did.

What is Machine Learning ?

Machine Learning is a way of teaching computers to learn from data and examples, instead of giving them step-by-step instructions.

In simple terms, just like humans learn by practicing and seeing examples (like learning to ride a cycle or solve a math problem), machines can also learn from experience—which in this case is lots of data. Once the computer has learned enough patterns from the data, it can start making decisions, predictions, or recognizing things on its own.

If we give a computer thousands of photos of cats and dogs, and tell it which ones are cats and which ones are dogs, it will start noticing patterns (like ears, fur, size). Later, if we show it a new photo, it can guess if it’s a cat or a dog—based on what it has learned earlier.

Key Idea

Learning with Data

Remember your cousin learning fruits? You showed them many apples and bananas, and they slowly figured out how to tell them apart. But what exactly helped them learn?

It was the examples you gave — their size, color, shape, and the correct names (labels).

Now, imagine doing this with a computer. You don’t “explain” the logic of what makes an apple an apple. Instead, you feed the computer a lot of data that includes:

And the computer learns patterns from this data.

The more clear, correct, and well-organized the data, the easier it is for the machine to understand and learn. If the data is messy or confusing (e.g., calling a banana an apple), the machine gets confused — just like your cousin would!

Data is the foundation of machine learning. It serves as the source from which machines learn patterns, relationships, and insights. In machine learning, data is generally categorized into two main types: labeled and unlabeled data.

Analogy

In machine learning, we can understand the learning process through a simple analogy. Imagine the machine as a student trying to learn a concept. Instead of having a teacher explain everything step by step, this student learns by reading a notebook full of examples. In this analogy, the machine is the student, the data is the notebook, and the algorithm is the method the student uses to study. The machine does not receive direct instructions or rules. Instead, it studies the data, looks at many examples, and gradually learns patterns and relationships on its own. This is how machines learn to make predictions or decisions — by identifying patterns in the data they are given.

Importance of Data

Types of Data

In Machine Learning, data is the key ingredient. The type and quality of data used directly impact how well a model learns and performs. Based on the presence or absence of labels (answers), machine learning data is generally divided into the following types:

Labeled Data

Labeled data contains both the input and the correct output (label). Each example in the dataset tells the machine what the correct answer should be. This type of data is used in Supervised Learning, where the goal is to learn a mapping from inputs to outputs. For example, a dataset of fruit features (color, shape, size) along with their names (apple, banana) is labeled data.

We can understand this with a simple analogy. Imagine a child learning to recognize fruits. A parent shows the child different fruits and says, “This is an apple,” or “This is a banana.” Over time, the child learns to associate certain shapes, colors, and sizes with the correct fruit names. Here, the fruit's features are the input, and the fruit's name is the label. The child is essentially learning from labeled data.

A similar process happens in machine learning. The machine is shown many examples, each with the correct label, and it learns to make predictions based on that.

Sample Labeled Data

Each row represent 1 labeled data point!

Color Shape Size Fruit Name (Label)
Red Round Small Apple
Yellow Long Medium Banana
Green Round Small Apple
Yellow Round Small Lemon

In this table, the machine sees the features of each fruit (color, shape, size) along with the correct name (label). Using this data, it learns how to predict the name of a new fruit just by looking at its features.

Unlabeled Data

Unlabeled data contains only input features, without any correct output or label. This type of data is used in unsupervised learning, where the machine explores the data and tries to find patterns or group similar items on its own — without being told what the correct answer is.

To understand this better, imagine giving a child a basket full of different fruits, but without telling them the names. The child may try to group similar-looking fruits together — putting round red ones in one pile, long yellow ones in another. Even though they don’t know the names, they are recognizing patterns. This is similar to how a machine works with unlabeled data.

In machine learning, when we remove the label column from a dataset, the machine is left with only the features. It then applies algorithms like clustering to group or analyze the data.

Sample Unlabeled Data

Each row represents 1 unlabeled data point!

Color Shape Size
Red Round Small
Yellow Long Medium
Green Round Small
Yellow Round Small

In this table, the machine sees only the features of each fruit — such as color, shape, and size — but not the actual name of the fruit. Since the labels are missing, the machine cannot directly learn which feature set belongs to which fruit. Instead, it tries to group similar rows together based on shared characteristics. For example, it might notice that some fruits are round and small, while others are long and medium-sized. By identifying these patterns, the machine can organize the data into meaningful clusters, even without knowing the actual fruit names.

Dataset

A dataset is a structured collection of data that is used to train, test, or evaluate a machine learning model. You can think of it as a digital notebook filled with many examples, where each example (called a record or row) contains information in the form of features (also called columns or attributes).

Just like a student studies from a textbook full of examples, a machine learning model learns from a dataset. The better and clearer the dataset, the more accurate the learning.

Difference between Data and Dataset

Aspect Data Dataset
Definition A single piece or unit of information A structured collection of related data
Structure May or may not be structured Usually structured (like tables: rows & columns)
Usage Used as basic input or value Used for training, testing, and evaluating models
Size Single value or a few values Collection of many records
Example "Red", 42, 3.14 Table of fruit details or student marks
Analogy A single word A full paragraph or a page of notes
Think of data as a single word in a book — it holds meaning, but on its own, it doesn’t tell you much. Now imagine a dataset as a full paragraph or a page filled with many words arranged meaningfully. Just like a paragraph gives you a complete idea or story, a dataset organizes many individual pieces of data into a structure that can be understood and analyzed. Similarly, in a real-world example, one student’s score in Math (say, 85) is a piece of data. But a full table showing the marks of all students in all subjects is a dataset — rich with information, patterns, and insights that can be used for analysis or training machine learning models.

Structure of Dataset

Types of Dataset

In machine learning, before a model is trained, the available dataset is usually divided into three main parts: training, validation, and testing datasets. But why classify dataset? Why split it when we can operate on one single dataset? The answer in my opinion is that this classification is essential for building a model that not only learns patterns from data but also generalizes well to new, unseen data.

It is generally divided into 3 Types, let's have a look at them!

Training Dataset

The training dataset is the largest portion, typically around 70–80% of the total data, and is used to teach the model by showing it various input-output examples so that model can learn and find patterns and relationships in it.

Validation Dataset

The validation dataset, often around 10–15%, is used during training to evaluate the model’s performance and help tune parameters (like learning rate, depth, etc) or avoid issues like overfitting.

Testing Dataset

Finally, the testing dataset, usually the remaining 10–15%, is used after training is complete to assess the final accuracy of the model on data it has never seen before.

Think of building a machine learning model like preparing for an exam. The training dataset is like your textbook and class notes — it’s what you use to study and learn the concepts. The validation dataset is like your mock tests or practice papers — they help you check how well you're learning and what needs to be improved. Finally, the testing dataset is like your actual exam — it evaluates your final performance on questions you've never seen before. Just like a student shouldn’t look at the real exam paper while preparing, a model shouldn’t see the test data during training. This separation ensures that the model truly understands the patterns and doesn't just memorize examples.
Save as PDF