Section 1 - Instruction

Earlier in this course, you practiced data preparation techniques like handling missing values, normalization, and standardization—even before learning about machine learning.

Now, let's connect those skills to real ML workflows using tensors. Every machine learning model relies on well-prepared tensor data, carrying your information from raw input all the way to predictions.

Engagement Message

Do these concepts sound familiar?

Section 2 - Instruction

Batching is the first crucial step. Instead of processing one sample at a time, we group multiple examples into batches for efficient computation.

A single image might be shape (224, 224, 3), but a batch of 32 images becomes (32, 224, 224, 3). The first dimension is always batch size.

Engagement Message

Why do you think processing batches is more efficient than individual samples?

Section 3 - Instruction

Let's see batching in action with a simple dataset:

torch.stack() creates a new dimension for the batch. This is how you combine individual tensors into training batches.

Engagement Message

What error would you expect if you try to stack samples of different shapes?

Section 4 - Instruction

Feature normalization standardizes your data so all features have similar scales. Without it, features with large values (like income) can dominate small ones (like age percentage).

This prevents bias toward high-magnitude features during model training.

Engagement Message

Can you think of two features that would have very different scales?

Section 5 - Instruction

Another common normalization is min-max scaling, which squashes values between 0 and 1:

This is perfect when you know your data's natural bounds.

Engagement Message

When might 0-1 scaling be better than mean-centered normalization?

Section 6 - Instruction

Handling missing data is crucial in real datasets. Common strategies include filling with means, medians, or learned values.

Missing data can break tensor operations, so preprocessing is essential before model input.

Engagement Message

What problems could arise if you ignored missing values?

Section 7 - Instruction

One-hot encoding converts categories into tensors. The category "dog" in classes ["cat", "dog", "bird"] becomes [0, 1, 0].

This lets models work with categorical data mathematically.

Engagement Message

Why can't you directly feed text labels like "dog" into neural networks?

Section 8 - Instruction

Putting it all together: a typical preprocessing pipeline loads raw data, creates batches, normalizes features, handles missing values, and encodes categories.

Each step uses tensor operations you've learned. The output is clean, properly-shaped tensors ready for model training or inference.

Engagement Message

Which preprocessing step would hurt model performance most if it were skipped?

Section 9 - Practice

Type

Fill In The Blanks

Markdown With Blanks

Let's map preprocessing steps to their purposes! Fill in the blanks for this data preparation pipeline:

A machine learning pipeline: Load data → Create [[blank:batches]] → [[blank:normalize]] features → Handle missing values → [[blank:encode]] categories → Feed to model

Suggested Answers

  • batches
  • normalize
  • encode
Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal