Poster
in
Workshop: The 3rd DL4C Workshop: Emergent Possibilities and Challenges in Deep Learning for Code
From Pseudo-Code to Source Code: A Self-Supervised Search Approach
Adithya Kulkarni · Mohna Chakraborty · Yonas Sium · Sai Valluri · Wei Le · Qi Li
Identifying algorithm implementations in source code is crucial for code comprehension, reference retrieval, and program synthesis. This paper presents PC2SC, a novel framework for mapping pseudo-code to source code without manual annotations. We introduce p-language, a structured representation that encodes control flow, mathematical expressions, and natural language descriptions of algorithms. A static analyzer extracts these features, converting pseudo-code into p-code, then embedded into a shared vector space with source code using self-supervised learning for retrieval.Given pseudo-code as input, PC2SC returns a ranked list of matching code snippets. Evaluations on the Stony Brook Algorithm Repository and GitHub projects demonstrate that PC2SC outperforms state-of-the-art code search tools in both C and Java. It successfully retrieves correct implementations within the top 25, 10, and 1 ranked results for 98.5\%, 93.8\%, and 66.2\% of queries, respectively. In GitHub projects, it identified 74 algorithm implementations out of 87 queries.PC2SC bridges the gap between algorithmic descriptions and executable implementations, offering a scalable, language-independent solution for algorithm retrieval and paving the way for future advancements in cross-language code search and automated synthesis.