Introduction

Welcome to the third lesson of Laying the Foundations for Code Translation with Haystack! So far, you've learned the basics of Haystack and built a simple code translation pipeline. Now, we're ready to make our translator much more powerful by adding preprocessing routines. This is a key step that will help our system handle real-world, messy inputs — making it more reliable and user-friendly.

In this lesson, you'll learn how to create a custom Haystack component that extracts clean code from mixed input and automatically detects the programming language. By the end, you'll see how these improvements make your code translator smarter and more robust.

Why Preprocessing Matters in Code Translation

Let's take a moment to understand why preprocessing is so important for code translation. In real scenarios, users rarely provide perfectly formatted code. Instead, you might see:

  • Code snippets mixed with explanations or natural language requests;
  • Code embedded in markdown or copied from documentation;
  • Unclear or missing information about the programming language.

If we try to translate such input directly, the results can be confusing or even incorrect. Preprocessing helps by:

  • Extracting just the code, ignoring any extra text;
  • Identifying the programming language automatically;
  • Ensuring the code is in a consistent format for translation.

By handling these challenges upfront, we make the translation process smoother and more reliable for everyone.

Designing a Custom Preprocessing Component

Haystack makes it easy to extend pipelines with your own logic using the @component decorator. Let's start by outlining a custom component that will handle our preprocessing tasks.

Here's what's happening:

  • The @component decorator tells Haystack this is a reusable pipeline component.
  • We set up an LLM to help with code understanding.
  • The run method will take in raw text and output both the extracted code and the detected language.
  • The @component.output_types decorator specifies the names and types of the outputs produced by the run method, making it clear to Haystack (and to other developers) what data this component will return.

This structure lets us plug our preprocessor directly into any Haystack pipeline, making it easy to reuse and maintain.

Extracting Code with an LLM

The first challenge is to reliably extract code from mixed or messy input. We'll use the LLM's strong language understanding to do this with a carefully crafted prompt.

This snippet demonstrates a practical use of prompt engineering. By being explicit — “extract ONLY the code” — we guide the LLM to ignore explanations or comments. The result is a clean markdown code block, ready for translation. This approach is especially useful when users paste code from tutorials, documentation, or chat conversations.

Inferring the Programming Language

Once we have the code, the next step is to figure out what language it's written in. Again, we'll use the LLM, but with a different prompt focused on language detection.

This method leverages the LLM's broad training on many programming languages. By asking for only the language name, we avoid extra text that could confuse downstream steps. If the LLM isn't sure, it simply returns “unknown,” making our pipeline more robust to ambiguous cases.

Integrating Preprocessing into the Pipeline

With our preprocessor ready, let's see how it fits into the overall translation pipeline. The goal is to make the pipeline handle raw, unstructured input seamlessly.

By placing the preprocessor at the start, we ensure that every input — no matter how messy — gets cleaned and analyzed before translation. The pipeline now automatically extracts code and detects its language, passing both to the prompt builder for translation. This design makes the system much more flexible and user-friendly.

You may also notice that we're not explicitly calling the run method of CodePreprocessor. In the Haystack framework, the run method of a component is typically called automatically when the pipeline is executed: the pipeline manages the flow of data between components, so you don't need to explicitly call run yourself.

Conclusion and What’s Next

With your new custom preprocessing component, your code translator is now ready to tackle the unpredictable, messy inputs of the real world. You’ve learned how to extract clean code and detect its language automatically—two superpowers that make your pipeline smarter and more reliable.

Ready to put your skills to the test? Up next is a hands-on practice section where you’ll build and experiment with your own preprocessing routines. This is your chance to see how your component handles all sorts of tricky inputs—so get creative!

And don’t stop there: after mastering preprocessing, we’ll dive into postprocessing techniques to polish your translated code even further. The journey to a truly robust code translation system is just getting started!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal