Chinese to English Format#

我经常会有将网络上的中文文章拷贝到我的个人知识库的需求. 我个人知识库是以英文为主, 并且为了可读性, 有一套标点符号的标准. 例如:

  1. 中英文之间有空格.

  2. 使用英文标点符号.

  3. 句子结束的标点符号之后有空格例如 ``, ``, ``. ``, ``: ``, ``; ``.

  4. 括号的前后有空格, 例如 `` ABC (Alice, Bob, Cathy) XYZ (…) ``.

所以我需要一个小工具能将中文格式化为符合上述标准的英文格式.

我起初是用 GPT AI 实现的, 后来发现输出不是很稳定, 所以我还是用 Python 自己写了一个小工具.

  1# -*- coding: utf-8 -*-
  2
  3"""
  4这是一个将中文文本转化为英文格式的脚本.
  5
  61. 把中文标点符号转换为英文标点符号.
  72. 把中文和英文之间加上空格.
  83. 在一些 "([<,." 等标点符号的前后视情况加上空格.
  9
 10How to use:
 11
 12翻到本脚本最下面, 将你的文本内容写在 main 函数的 text 变量中, 然后运行脚本即可. 它会自动
 13将输出的文本拷贝到剪贴板, 你只需 Ctrl + V 复制既可.
 14
 15Requirements::
 16
 17    pyperclip>=1.8.2,<2.0.0
 18"""
 19
 20import string
 21
 22mapper = {
 23    "‘": "'",
 24    "(": "(",
 25    ")": ")",
 26    ",": ",",
 27    "。": ".",
 28    "-": "-",
 29    "–": "-",
 30    "?": "?",
 31    ":": ":",
 32    ";": ";",
 33    "!": "!",
 34    "、": ",",
 35    "…": "...",
 36    "“": '"',
 37    "”": '"',
 38    "《": "<",
 39    "》": ">",
 40    "【": "[",
 41    "】": "]",
 42    "~": "~",
 43}
 44
 45pre_space_chars = {"(", "[", "<"}
 46post_space_chars = {")", "]", ">", ",", ".", ":", ";", "?", "!"}
 47pre_and_post_space_chars = {"/", "|", "&", "+"}
 48non_cn_chars = set(string.ascii_letters + string.digits + string.punctuation)
 49ascii_chars = set(string.ascii_letters + string.digits)
 50stop_chars = {",", ".", ":", ";", "?", "!"}
 51
 52
 53def main(text: str) -> str:
 54    chars = [mapper.get(char, char) for char in list(text)]
 55    print("--- after mapper ---\n{}".format("".join(chars)))  # for debug only
 56    new_chars = list()
 57    for i, char in enumerate(chars):
 58        if char in pre_space_chars:
 59            try:
 60                if chars[i - 1] != " ":
 61                    new_chars.append(" ")
 62                    new_chars.append(char)
 63                else:
 64                    new_chars.append(char)
 65            except IndexError:
 66                new_chars.append(" ")
 67                new_chars.append(char)
 68
 69        elif char in post_space_chars:
 70            try:
 71                c = chars[i + 1]
 72                if c != " " and c not in stop_chars:
 73                    new_chars.append(char)
 74                    new_chars.append(" ")
 75                else:
 76                    new_chars.append(char)
 77            except IndexError:
 78                new_chars.append(char)
 79                new_chars.append(" ")
 80
 81        elif char in pre_and_post_space_chars:
 82            try:
 83                if chars[i - 1] != " ":
 84                    new_chars.append(" ")
 85            except IndexError:
 86                pass
 87            new_chars.append(char)
 88            try:
 89                c = chars[i + 1]
 90                if c != " " and c not in stop_chars:
 91                    new_chars.append(" ")
 92            except IndexError:
 93                pass
 94
 95        else:
 96            # 如果是 a-zA-Z0-9
 97            if char in ascii_chars:
 98                #
 99                try:
100                    if chars[i - 1] not in non_cn_chars:
101                        new_chars.append(" ")
102                except IndexError:
103                    pass
104                new_chars.append(char)
105                try:
106                    if chars[i + 1] not in non_cn_chars:
107                        new_chars.append(" ")
108                except IndexError:
109                    pass
110            else:
111                new_chars.append(char)
112    text = "".join(new_chars)
113    print(f"--- after format ---\n{text}")  # for debug only
114    new_text = " ".join([word for word in text.split(" ") if word.strip()])
115    print("--- output ---")  # for debug only
116    print(new_text)
117    return new_text
118
119
120# enter your text here
121text = """
122W-2表格,又叫做年度工资总结表,W-2表格是雇主需要在每个报税年结束之后发给每个雇员和美国国家税务局Internal Revenue Service(IRS)的报税文件。W-2表格报告了员工的年薪和工资中扣缴的各类税款(联邦税、州税、地方税等)。W-2是非常非常非常重要的报税文件,里面的信息是填写报税表的关键!
123""".strip()
124new_text = main(text)
125
126try:
127    import pyperclip
128
129    pyperclip.copy(new_text)
130    print("output copied to clipboard! You can paste it now.")
131except ImportError:
132    pass
133