CSDN - 程序員要如何創建一門編程語言？－鑽石舞台

May 30 Mon 2022 21:30
CSDN - 程序員要如何創建一門編程語言？

作者 | Md Shuvo 譯者 | 彎月

出品 | CSDN（ID：CSDNnews）

雖然每位開發人員都掌握了一種甚至多種編程語言，但你是否曾想過自己動手創建一種編程語言？

首先，我們來看看什麼是編程語言：

編程語言是用來定義計算機程序的形式語言。它是一種被標準化的交流技巧，用來向計算機發出指令，一種能夠讓程序員準確地定義計算機所需要使用數據的計算機語言，並精確地定義在不同情況下所應當採取的行動。

簡而言之，編程語言就是一組預定義的規則。然後，我們需要通過編譯器、解釋器等來解釋這些規則。所以我們可以簡單地定義一些規則，然後，再使用任何現有的編程語言來製作一個可以理解這些規則的程序，也就是我們所說的解釋器。

編譯器：編譯器能夠將代碼轉換為處理器可以執行的機器代碼（例如 C++ 編譯器）。解釋器：解釋器能夠逐行遍歷程序並執行每個命令。

下面，我們來試試看創建一個超級簡單的編程語言，在控制台中輸出洋紅色的字體，我們給它起個名字：Magenta（洋紅色）。

建立編程語言

在本文中，我將使用 Node.js，但你可以使用任何語言，基本思路依然是一樣的。首先，我們來創建一個 index.js 文件。

class Magenta { constructor(codes) { this.codes = codes } run() { console.log(this.codes) }} // For now, we are storing codes in a string variable called `codes`// Later, we will read codes from a fileconst codes =`print "hello world"print "hello again"`const magenta = new Magenta(codes)magenta.run()

這段代碼聲明了一個名為 Magenta 的類。該類定義並初始化了一個對象，而該對象負責將我們通過變量 codes 提供的文本顯示到控制台。我們在文件中直接定義了變量 codes：幾個帶有「hello」的消息。

下面，我們來創建詞法分析器。

什麼是詞法分析器？

我們拿英文舉個例子：How are you?

此處，「How」是副詞，「are」是動詞，「you」是代詞。最後還有一個問號（？）。我們可以按照這種方式，通過JavaScript編程將句子或短語劃分為多個語法組件。還有一種方法是，將這些句子或短語分割成一個個標記。將文本分割成標記的程序就是詞法分析器。

由於我們的這個編程語言非常小，它只有兩種類型的標記，每一種只有一個值：

1.keyword

2. string

我們可以使用正則表達式，從字符串 codes 中提取標記，但性能會非常慢。更好的方法是遍歷字符串 codes 中的每個字符並提取標記。下面，我們在 Magenta 類中創建一個方法tokenize（這就是我們的詞法分析器）。

完整的代碼如下：

class Magenta { constructor(codes) { this.codes = codes } tokenize() { const length = this.codes.length // pos keeps track of current position/index let pos = 0 let tokens = [] const BUILT_IN_KEYWORDS = ["print"] // allowed characters for variable/keyword const varChars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_' while (pos < length) { let currentChar = this.codes[pos] // if current char is space or newline, continue if (currentChar === " " || currentChar === "\n") { pos++ continue } else if (currentChar === '"') { // if current char is " then we have a string let res = "" pos++ // while next char is not " or \n and we are not at the end of the code while (this.codes[pos] !== '"' && this.codes[pos] !== '\n' && pos < length) { // adding the char to the string res += this.codes[pos] pos++ } // if the loop ended because of the end of the code and we didn't find the closing " if (this.codes[pos] !== '"') { return { error: `Unterminated string` } } pos++ // adding the string to the tokens tokens.push({ type: "string", value: res }) } else if (varChars.includes(currentChar)) { let res = currentChar pos++ // while the next char is a valid variable/keyword charater while (varChars.includes(this.codes[pos]) && pos < length) { // adding the char to the string res += this.codes[pos] pos++ } // if the keyword is not a built in keyword if (!BUILT_IN_KEYWORDS.includes(res)) { return { error: `Unexpected token ${res}` } } // adding the keyword to the tokens tokens.push({ type: "keyword", value: res }) } else { // we have a invalid character in our code return { error: `Unexpected character ${this.codes[pos]}` } } } // returning the tokens return { error: false, tokens } } run() { const { tokens, error } = this.tokenize() if (error) { console.log(error) return } console.log(tokens) }}

在終端中運行node index.js，就會看到控制台中輸出的標記列表。

定義規則和語法

我們想看看 codes 的順序是否符合某種規則或語法。但首先我們需要定義這些規則和語法是什麼。由於我們的這個編程語言非常小，它只有一種簡單的語法，即關鍵字 print 後跟一個字符串。

keyword:print string

因此，我們來創建一個 parse 方法，循環遍歷 codes 並提取標記，看看是否形成了有效的語法，並根據需要採用採取必要的行動。

class Magenta { constructor(codes) { this.codes = codes } tokenize(){ /* previous codes for tokenizer */ } parse(tokens){ const len = tokens.length let pos = 0 while(pos < len) { const token = tokens[pos] // if token is a print keyword if(token.type === "keyword" && token.value === "print") { // if the next token doesn't exist if(!tokens[pos + 1]) { return console.log("Unexpected end of line, expected string") } // check if the next token is a string let isString = tokens[pos + 1].type === "string" // if the next token is not a string if(!isString) { return console.log(`Unexpected token ${tokens[pos + 1].type}, expected string`) } // if we reach this point, we have valid syntax // so we can print the string console.log('\x1b[35m%s\x1b[0m', tokens[pos + 1].value) // we add 2 because we also check the token after print keyword pos += 2 } else{ // if we didn't match any rules return console.log(`Unexpected token ${token.type}`) } } } run(){ const {tokens, error} = this.tokenize() if(error){ console.log(error) return } this.parse(tokens) }}

如下所示，我們的編程語言已經能夠正常工作了！

由於字符串變量 codes 是硬編碼的，因此輸出其中包含的字符串意義也不大。因此，我們將codes 放入一個名為 code.m 的文件中。這樣，變量 codes（輸出數據）與編譯器的實現邏輯就互相分離了。我們使用 .m 作為文件擴展名，以此來表明該文件包含 Magenta 語言的代碼。

下面，我們來修改代碼，從該文件中讀取codes：

// importing file system moduleconst fs = require('fs')//importing path module for convenient path joiningconst path = require('path')class Magenta{ constructor(codes){ this.codes = codes } tokenize(){ /* previous codes for tokenizer */ } parse(tokens){ /* previous codes for parse method */ } run(){ /* previous codes for run method */ }} // Reading code.m file// Some text editors use \r\n for new line instead of \n, so we are removing \rconst codes = fs.readFileSync(path.join(__dirname, 'code.m'), 'utf8').toString().replace(/\r/g, "")const magenta = new Magenta(codes)magenta.run()