Imagine you’re teaching your little cousin to recognize fruits. You show them a red round fruit and say, “This is an apple.” Then you show a yellow long one and say, “This is a banana.” After seeing a few more apples and bananas, the next time you hand them a fruit they’ve never seen, they guess, “Is this an apple?” based on what they learned.
Now imagine doing the same thing with a computer—showing it lots of pictures of apples and bananas, and letting it learn the difference on its own. That’s Machine Learning! It’s like teaching a computer by example, so it can start recognizing patterns and make smart guesses, just like your cousin did.
Machine Learning is a way of teaching computers to learn from data and examples, instead of giving them step-by-step instructions.
In simple terms, just like humans learn by practicing and seeing examples (like learning to ride a cycle or solve a math problem), machines can also learn from experience—which in this case is lots of data. Once the computer has learned enough patterns from the data, it can start making decisions, predictions, or recognizing things on its own.
Remember your cousin learning fruits? You showed them many apples and bananas, and they slowly figured out how to tell them apart. But what exactly helped them learn?
It was the examples you gave — their size, color, shape, and the correct names (labels).
Now, imagine doing this with a computer. You don’t “explain” the logic of what makes an apple an apple. Instead, you feed the computer a lot of data that includes:
And the computer learns patterns from this data.
The more clear, correct, and well-organized the data, the easier it is for the machine to understand and learn. If the data is messy or confusing (e.g., calling a banana an apple), the machine gets confused — just like your cousin would!
In machine learning, we can understand the learning process through a simple analogy. Imagine the machine as a student trying to learn a concept. Instead of having a teacher explain everything step by step, this student learns by reading a notebook full of examples. In this analogy, the machine is the student, the data is the notebook, and the algorithm is the method the student uses to study. The machine does not receive direct instructions or rules. Instead, it studies the data, looks at many examples, and gradually learns patterns and relationships on its own. This is how machines learn to make predictions or decisions — by identifying patterns in the data they are given.
In Machine Learning, data is the key ingredient. The type and quality of data used directly impact how well a model learns and performs. Based on the presence or absence of labels (answers), machine learning data is generally divided into the following types:
Labeled data contains both the input and the correct output (label). Each example in the dataset tells the machine what the correct answer should be. This type of data is used in Supervised Learning, where the goal is to learn a mapping from inputs to outputs. For example, a dataset of fruit features (color, shape, size) along with their names (apple, banana) is labeled data.
Each row represent 1 labeled data point!
| Color | Shape | Size | Fruit Name (Label) |
|---|---|---|---|
| Red | Round | Small | Apple |
| Yellow | Long | Medium | Banana |
| Green | Round | Small | Apple |
| Yellow | Round | Small | Lemon |
In this table, the machine sees the features of each fruit (color, shape, size) along with the correct name (label). Using this data, it learns how to predict the name of a new fruit just by looking at its features.
Unlabeled data contains only input features, without any correct output or label. This type of data is used in unsupervised learning, where the machine explores the data and tries to find patterns or group similar items on its own — without being told what the correct answer is.
Each row represents 1 unlabeled data point!
| Color | Shape | Size |
|---|---|---|
| Red | Round | Small |
| Yellow | Long | Medium |
| Green | Round | Small |
| Yellow | Round | Small |
In this table, the machine sees only the features of each fruit — such as color, shape, and size — but not the actual name of the fruit. Since the labels are missing, the machine cannot directly learn which feature set belongs to which fruit. Instead, it tries to group similar rows together based on shared characteristics. For example, it might notice that some fruits are round and small, while others are long and medium-sized. By identifying these patterns, the machine can organize the data into meaningful clusters, even without knowing the actual fruit names.
A dataset is a structured collection of data that is used to train, test, or evaluate a machine learning model. You can think of it as a digital notebook filled with many examples, where each example (called a record or row) contains information in the form of features (also called columns or attributes).
Just like a student studies from a textbook full of examples, a machine learning model learns from a dataset. The better and clearer the dataset, the more accurate the learning.
| Aspect | Data | Dataset |
|---|---|---|
| Definition | A single piece or unit of information | A structured collection of related data |
| Structure | May or may not be structured | Usually structured (like tables: rows & columns) |
| Usage | Used as basic input or value | Used for training, testing, and evaluating models |
| Size | Single value or a few values | Collection of many records |
| Example | "Red", 42, 3.14 | Table of fruit details or student marks |
| Analogy | A single word | A full paragraph or a page of notes |
In machine learning, before a model is trained, the available dataset is usually divided into three main parts: training, validation, and testing datasets. But why classify dataset? Why split it when we can operate on one single dataset? The answer in my opinion is that this classification is essential for building a model that not only learns patterns from data but also generalizes well to new, unseen data.
It is generally divided into 3 Types, let's have a look at them!
The training dataset is the largest portion, typically around 70–80% of the total data, and is used to teach the model by showing it various input-output examples so that model can learn and find patterns and relationships in it.
The validation dataset, often around 10–15%, is used during training to evaluate the model’s performance and help tune parameters (like learning rate, depth, etc) or avoid issues like overfitting.
Finally, the testing dataset, usually the remaining 10–15%, is used after training is complete to assess the final accuracy of the model on data it has never seen before.