Leverage GenAI For Code Writing#
This blog post explores how to use Generative AI (GenAI) to produce functional code snippets that align with a specific codebase’s project characteristics and personal style. We’ll delve into the use of knowledge bases and few-shot learning techniques.
This blog is hosted at: https://dev-exp-share.readthedocs.io/en/latest/search.html?q=Leverage+GenAI+For+Code+Writing&check_keywords=yes&area=default
Note
2025-03-03 update, I created a better tool called docpack that can replace this workflow. However, I still want to keep this post here for reference.
Sample Use Case#
Let’s consider a practical example using my jsonpolars library (https://github.com/MacHu-GWU/jsonpolars-project). This library allows users to write DataFrame manipulation logic using JSON, providing a programming language-agnostic interface. In jsonpolars, all Polars expressions and DataFrame methods are abstracted into Python class objects with JSON serialization interfaces. However, with over 200 Polars expressions, manual implementation would be time-consuming. My approach was to establish the code framework and design patterns myself, then provide GenAI with a few-shot examples to generate the remaining code.
Train AI with the Knowledge Base#
To begin, I created an automated script to extract key code from the library and consolidate it into a knowledge base. Here’s an example of such a script:
generate_knowledge_base.py
1# -*- coding: utf-8 -*-
2
3"""
4这个脚本可以把代码源码转化为 GenAI 可以理解的知识库.
5"""
6
7import typing as T
8import shutil
9import fnmatch
10import dataclasses
11from pathlib import Path
12from jinja2 import Template
13
14
15dir_here = Path(__file__).absolute().parent
16path_source_code_tpl = dir_here / "source_code_knowledge_base.jinja"
17path_test_cases_tpl = dir_here / "test_cases_knowledge_base.jinja"
18
19
20@dataclasses.dataclass
21class PyModule:
22 relpath: str = dataclasses.field()
23 content: str = dataclasses.field()
24
25
26def extract_pymodule_list(
27 dir_src: Path,
28 glob: str = "**/*",
29 ignore: T.Optional[T.List[str]] = None,
30) -> T.List[PyModule]:
31 if ignore is None:
32 ignore = []
33
34 pymodule_list = list()
35
36 dirname = dir_src.name
37 for path in dir_src.glob(glob):
38 relpath = path.relative_to(dir_src)
39
40 # identify whether it should be ignored
41 match_ignore = False
42 for pattern in ignore:
43 match_ignore = fnmatch.fnmatch(str(relpath), pattern)
44 if match_ignore is True:
45 break
46
47 if match_ignore is True:
48 continue
49
50 pymodule = PyModule(
51 relpath=str(Path(dirname).joinpath(relpath)),
52 content=path.read_text(),
53 )
54 pymodule_list.append(pymodule)
55
56 # sort by file path
57 pymodule_list = list(sorted(pymodule_list, key=lambda x: x.relpath))
58 return pymodule_list
59
60
61def reset_dir_out(dir_out: Path):
62 if dir_out.exists():
63 shutil.rmtree(dir_out)
64 dir_out.mkdir(exist_ok=True)
65
66
67def generate_source_code_knowledge_base(
68 project_name: str,
69 dir_src: Path,
70 dir_out: Path,
71 glob: str = "**/*",
72 ignore: T.Optional[T.List[str]] = None,
73):
74 pymodule_list = extract_pymodule_list(
75 dir_src=dir_src,
76 glob=glob,
77 ignore=ignore,
78 )
79 tpl = Template(path_source_code_tpl.read_text())
80 path_source_code_knowledge_base = dir_out / "source_code_knowledge_base.py"
81 content = tpl.render(
82 project_name=project_name,
83 pymodule_list=pymodule_list,
84 )
85 path_source_code_knowledge_base.write_text(content)
86
87
88def generate_test_cases_knowledge_base(
89 project_name: str,
90 dir_src: Path,
91 dir_out: Path,
92 glob: str = "**/*",
93 ignore: T.Optional[T.List[str]] = None,
94):
95 pymodule_list = extract_pymodule_list(
96 dir_src=dir_src,
97 glob=glob,
98 ignore=ignore,
99 )
100 tpl = Template(path_test_cases_tpl.read_text())
101 path_test_cases_knowledge_base = dir_out / "test_cases_knowledge_base.py"
102 content = tpl.render(
103 project_name=project_name,
104 pymodule_list=pymodule_list,
105 )
106 path_test_cases_knowledge_base.write_text(content)
107
108
109dir_out = dir_here / "tmp"
110reset_dir_out(dir_out)
111dir_project_root = Path.home() / "Documents" / "GitHub" / "s3manifesto-project"
112
113# generate_source_code_knowledge_base
114dir_src = dir_project_root / "s3manifesto"
115generate_source_code_knowledge_base(
116 project_name="s3manifesto",
117 dir_src=dir_src,
118 dir_out=dir_out,
119 glob="**/*.py",
120 ignore=[
121 "test/*",
122 "vendor/*",
123 ],
124)
125
126# generate_source_code_knowledge_base
127dir_src = dir_project_root / "tests"
128generate_test_cases_knowledge_base(
129 project_name="s3manifesto",
130 dir_src=dir_src,
131 dir_out=dir_out,
132 glob="**/*.py",
133 ignore=[],
134)
The resulting knowledge base text includes the codebase structure and source code for each module:
source_code_knowledge_base.py
test_cases_knowledge_base.py
Generate a Narrative Description#
Next, I used AI to generate an introductory text for the codebase. The prompt was:
prompt
jsonpolars is A Python library that enables writing Polars data manipulation logic using JSON syntax.
Please use the project knowledge information to write an introduction to this project, no more than 1000 words.
下面是生成的 Instruction:
generated instruction
Thank you for providing that additional information. I’ll incorporate it into the description. Here’s an updated introduction to the jsonpolars project, including the new details:
jsonpolars is an innovative Python library designed to bridge the gap between JSON-based data manipulation syntax and the powerful Polars data processing library. This project aims to provide a flexible and intuitive way to express Polars operations using JSON structures, making it easier for developers to work with Polars in various contexts.
Key features of jsonpolars include:
JSON-based expression of Polars operations: The library allows users to define complex data transformations using JSON syntax, which can then be translated into native Polars operations.
Object-oriented representation of Polars operations: jsonpolars converts each Polars DataFrame method and expression into a corresponding Python class. This approach provides a structured way to represent Polars operations in a JSON-compatible format.
Serialization and deserialization: Each class in jsonpolars provides a from_dict method to deserialize it from a Python dictionary and a to_dict method to serialize it back to a dictionary. This enables easy conversion between JSON representations and Python objects.
Polars integration: The most crucial method for each class is to_polars, which converts the jsonpolars object into the corresponding Polars method or expression. This allows seamless integration with existing Polars workflows.
Support for DataFrame operations (dfop): jsonpolars implements various DataFrame operations such as select, rename, drop, with_columns, head, tail, sort, and drop_nulls. These operations can be expressed in JSON format and executed on Polars DataFrames.
Expression handling (expr): The library provides support for column operations, datetime manipulations, literal values, arithmetic operations, list operations, type casting, and string functions. These expressions can be combined to create complex transformations.
Extensible architecture: jsonpolars is designed with extensibility in mind, allowing for easy addition of new operations and expressions as the Polars library evolves.
Type hinting and documentation: jsonpolars includes comprehensive type hints and documentation to aid developers in understanding and using the library effectively.
The project structure includes separate modules for DataFrame operations (dfop) and expressions (expr), as well as utility functions and type definitions. It also includes a testing framework to ensure the correctness of the JSON-to-Polars conversions.
jsonpolars can be particularly useful in scenarios where data processing pipelines need to be defined in a language-agnostic manner, such as in distributed systems or when working with configuration-driven data transformations. By allowing users to express Polars operations in JSON, the library enables greater flexibility in how data processing logic is stored, transmitted, and executed.
The object-oriented approach of jsonpolars, combined with its serialization and deserialization capabilities, provides a robust foundation for working with Polars operations in a JSON-compatible format. This design allows for easy manipulation, storage, and transmission of data processing logic, while the to_polars method ensures that these operations can be efficiently executed using the native Polars library.
While the library is still in its early stages (version 0.1.1), it shows promise in simplifying the integration of Polars into JSON-centric ecosystems and providing a new approach to defining data manipulation workflows. The combination of JSON compatibility and direct Polars integration makes jsonpolars a powerful tool for developers working with data processing pipelines in Python.
Generating Source Code and Test Cases#
With the instruction and knowledge base in place, we can now instruct the AI to generate code that strictly adheres to the project’s coding style. Here’s an example prompt:
prompt
Based on the project knowledge, including how I write the source code and how I write the unit test, help me write the corresponding Python class implementation and Unit Test code for polars.Expr.dt.year method.
The polars expression document is at https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.year.html.
Below is the corresponding polars document from their official website:
```python
def year(self) -> Expr:
"""
Extract year from underlying Date representation.
Applies to Date and Datetime columns.
Returns the year number in the calendar date.
Returns
-------
Expr
Expression of data type :class:`Int32`.
Examples
--------
>>> from datetime import date
>>> df = pl.DataFrame(
... {"date": [date(1977, 1, 1), date(1978, 1, 1), date(1979, 1, 1)]}
... )
>>> df.with_columns(
... calendar_year=pl.col("date").dt.year(),
... iso_year=pl.col("date").dt.iso_year(),
... )
shape: (3, 3)
┌────────────┬───────────────┬──────────┐
│ date ┆ calendar_year ┆ iso_year │
│ --- ┆ --- ┆ --- │
│ date ┆ i32 ┆ i32 │
╞════════════╪═══════════════╪══════════╡
│ 1977-01-01 ┆ 1977 ┆ 1976 │
│ 1978-01-01 ┆ 1978 ┆ 1977 │
│ 1979-01-01 ┆ 1979 ┆ 1979 │
└────────────┴───────────────┴──────────┘
"""
```
AI Generated Source Code and Test Case
Certainly! I'll create the corresponding Python class implementation for `polars.Expr.dt.year` method and its unit test based on the project structure and coding style you've shown.
First, let's implement the `DtYear` class in the `jsonpolars/expr/dt.py` file:
```python
# -*- coding: utf-8 -*-
import typing as T
import dataclasses
import polars as pl
from ..sentinel import NOTHING, REQUIRED, OPTIONAL
from ..base_expr import ExprEnum, BaseExpr, expr_enum_to_klass_mapping, parse_expr
if T.TYPE_CHECKING: # pragma: no cover
from .api import T_EXPR
@dataclasses.dataclass
class DtYear(BaseExpr):
type: str = dataclasses.field(default=ExprEnum.dt_year.value)
expr: "T_EXPR" = dataclasses.field(default=REQUIRED)
@classmethod
def from_dict(cls, dct: T.Dict[str, T.Any]):
return cls(expr=parse_expr(dct["expr"]))
def to_polars(self) -> pl.Expr:
return ensure_datetime(self.expr).year()
expr_enum_to_klass_mapping[ExprEnum.dt_year.value] = DtYear
```
Now, let's add the corresponding unit test in the `tests/test_expr_dt.py` file:
```python
# -*- coding: utf-8 -*-
from datetime import date
import polars as pl
from jsonpolars.expr import api as expr
from jsonpolars.tests.expr_case import Case
case_dt_year = Case(
input_records=[
{"date": date(1977, 1, 1)},
{"date": date(1978, 1, 1)},
{"date": date(1979, 1, 1)},
],
expr=expr.DtYear(
expr=expr.Column(name="date"),
),
output_records=[
{"date": 1977},
{"date": 1978},
{"date": 1979},
],
)
def test():
print("")
# ... (existing test cases)
case_dt_year.run_with_columns_test()
if __name__ == "__main__":
from jsonpolars.tests import run_cov_test
run_cov_test(__file__, "jsonpolars.expr.dt", preview=False)
```
To complete the implementation, you'll need to make a few more changes:
1. Add `DtYear` to the `ExprEnum` class in `jsonpolars/base_expr.py`:
```python
class ExprEnum(str, enum.Enum):
# ... (existing enums)
dt_year = "dt_year"
# ... (other enums)
```
2. Update the `T_EXPR` type alias in `jsonpolars/expr/api.py`:
```python
T_EXPR = T.Union[
# ... (existing types)
DtYear,
# ... (other types)
]
```
3. Import `DtYear` in `jsonpolars/expr/api.py`:
```python
from .dt import DtYear
```
These changes will integrate the new `DtYear` class into your existing `jsonpolars` structure, following the patterns established in your project. The unit test case demonstrates how to use the `DtYear` expression and verifies its functionality by comparing the output with the expected results.
Test the Generated Code#
After generating the code, the next step is to test it to ensure it functions correctly within the existing codebase.