Chinese to English Format#
我经常会有将网络上的中文文章拷贝到我的个人知识库的需求. 我个人知识库是以英文为主, 并且为了可读性, 有一套标点符号的标准. 例如:
中英文之间有空格.
使用英文标点符号.
括号的前后有空格, 例如 `` ABC (Alice, Bob, Cathy) XYZ (…) ``.
所以我需要一个小工具能将中文格式化为符合上述标准的英文格式.
我起初是用 GPT AI 实现的, 后来发现输出不是很稳定, 所以我还是用 Python 自己写了一个小工具.
1# -*- coding: utf-8 -*-
2
3"""
4这是一个将中文文本转化为英文格式的脚本.
5
61. 把中文标点符号转换为英文标点符号.
72. 把中文和英文之间加上空格.
83. 在一些 "([<,." 等标点符号的前后视情况加上空格.
9
10How to use:
11
12翻到本脚本最下面, 将你的文本内容写在 main 函数的 text 变量中, 然后运行脚本即可. 它会自动
13将输出的文本拷贝到剪贴板, 你只需 Ctrl + V 复制既可.
14
15Requirements::
16
17 pyperclip>=1.8.2,<2.0.0
18"""
19
20import string
21
22mapper = {
23 "‘": "'",
24 "(": "(",
25 ")": ")",
26 ",": ",",
27 "。": ".",
28 "-": "-",
29 "–": "-",
30 "?": "?",
31 ":": ":",
32 ";": ";",
33 "!": "!",
34 "、": ",",
35 "…": "...",
36 "“": '"',
37 "”": '"',
38 "《": "<",
39 "》": ">",
40 "【": "[",
41 "】": "]",
42 "~": "~",
43}
44
45pre_space_chars = {"(", "[", "<"}
46post_space_chars = {")", "]", ">", ",", ".", ":", ";", "?", "!"}
47pre_and_post_space_chars = {"/", "|", "&", "+"}
48non_cn_chars = set(string.ascii_letters + string.digits + string.punctuation)
49ascii_chars = set(string.ascii_letters + string.digits)
50stop_chars = {",", ".", ":", ";", "?", "!"}
51
52
53def main(text: str) -> str:
54 chars = [mapper.get(char, char) for char in list(text)]
55 print("--- after mapper ---\n{}".format("".join(chars))) # for debug only
56 new_chars = list()
57 for i, char in enumerate(chars):
58 if char in pre_space_chars:
59 try:
60 if chars[i - 1] != " ":
61 new_chars.append(" ")
62 new_chars.append(char)
63 else:
64 new_chars.append(char)
65 except IndexError:
66 new_chars.append(" ")
67 new_chars.append(char)
68
69 elif char in post_space_chars:
70 try:
71 c = chars[i + 1]
72 if c != " " and c not in stop_chars:
73 new_chars.append(char)
74 new_chars.append(" ")
75 else:
76 new_chars.append(char)
77 except IndexError:
78 new_chars.append(char)
79 new_chars.append(" ")
80
81 elif char in pre_and_post_space_chars:
82 try:
83 if chars[i - 1] != " ":
84 new_chars.append(" ")
85 except IndexError:
86 pass
87 new_chars.append(char)
88 try:
89 c = chars[i + 1]
90 if c != " " and c not in stop_chars:
91 new_chars.append(" ")
92 except IndexError:
93 pass
94
95 else:
96 # 如果是 a-zA-Z0-9
97 if char in ascii_chars:
98 #
99 try:
100 if chars[i - 1] not in non_cn_chars:
101 new_chars.append(" ")
102 except IndexError:
103 pass
104 new_chars.append(char)
105 try:
106 if chars[i + 1] not in non_cn_chars:
107 new_chars.append(" ")
108 except IndexError:
109 pass
110 else:
111 new_chars.append(char)
112 text = "".join(new_chars)
113 print(f"--- after format ---\n{text}") # for debug only
114 new_text = " ".join([word for word in text.split(" ") if word.strip()])
115 print("--- output ---") # for debug only
116 print(new_text)
117 return new_text
118
119
120# enter your text here
121text = """
122W-2表格,又叫做年度工资总结表,W-2表格是雇主需要在每个报税年结束之后发给每个雇员和美国国家税务局Internal Revenue Service(IRS)的报税文件。W-2表格报告了员工的年薪和工资中扣缴的各类税款(联邦税、州税、地方税等)。W-2是非常非常非常重要的报税文件,里面的信息是填写报税表的关键!
123""".strip()
124new_text = main(text)
125
126try:
127 import pyperclip
128
129 pyperclip.copy(new_text)
130 print("output copied to clipboard! You can paste it now.")
131except ImportError:
132 pass
133