Bin Wang - My Personal Blog

Use OpenAPI Instead of MCP for LLM Tools

2025-04-27T00:00:00-04:00

Module Context Protocol (MCP) is adopted by more and more people as a way to integrate tools to LLMs easily. However, I find it unintuitive and unnecessarily complex. So in this article, I’ll explore how to use existing OpenAPI servers as tools of LLMs instead of writing the functions in a completely new protocol. This can potentially become a simpler standard that only needs implementation an additional authentication flow (if authentication is needed).

For the people who are not familar with OpenAPI, it’s a formal way to descrit HTTP APIs. You may heard of Swagger which is basically the same thing. Lots of HTTP framework support it so that you can generate structured document in JSON or YAML format, and view it in tools like Swagger Editor. Because of this structured document, it’s perfect to be feeded into LLM as tool definations.

The final result is in the repo ai-tool-proto-experiment. It’s a single file Scala script with less than 300 lines of code. It doesn’t use any LLM SDK, just simple HTTP calls to the LLM providers. It doesn’t use any advance API either. Only chat completion API with structured output is needed.

Goals and Non-Goals

Tool server is only part of MCP. It has more like prompts and resources. Personally I don’t see much benefit to include so many use cases in a single protocol.

For example, prompts are just a server API that you can get all the pre-defined prompts. That’s very easy to implement with any protocol and there is really no need to combine it with the LLM tool protocol.

So in this article, we will only explore how to integrate other services as tools of LLM, without caring about the other parts of MCP like prompts and resources.

There are also some other things that MCP doesn’t resolve, like security. This post summarized lots of security problems in MCP, and I don’t think there is an easy way around that even if we use existing protocols like OpenAPI. So the goal in this experiment is only use trusted OpenAPI servers, without worrying about attacks like tool shadowing. With that said, authentication is still necessary to the OpenAPI server, which is a protect of the server instead of the client. MCP only added authentication into the spec recently. As you can see later, the authentication workflow I tried here is much more simpler and generic.

At last, use as little LLM API as possible is also a goal, so that it’s easier to port the implementation to other LLM providers.

The Implementation

Using something like OpenAPI is not a new idea. I’ve seen multiple people mentioned it on places like HackerNews. And during my implementation, I also found Open WebUI, a tool I self hosted and used daily also added support to use OpenAPI servers as tools. Nevertheless, I still try to experiment my own implementation because I want to keep it as simple as possible, and also learn more details about the capability of such approach.

In the experiment, I tried both a simple open source weather OpenAPI server, and my own project RSS Brain. I’ll try to explain how it is implemented and talk about an experiment result at the end.

Define the Tool Calling Structure

Lots of LLM providers support tool calling APIs. We will avoid to use those APIs just to keep things simpler and make it more general available for other LLMs, including the self hosted ones. So instead, we define our own JSON schema that we want the LLM to follow and feed it as part of the system prompt, also use structured output API to enforce the LLM response follow the JSON schema. I said in the beginning that I want to use as little feature as possible, but I think structured output is an important enough feature I need to use in addition to the basic chat completion. Fortunately lots of other LLMs including the local ones like Ollama also support this feature.

Here are the response structure we want, in the format of Scala class definition:

case class ToolParam(
    httpRequestEndpoint: String,
    httpRequestPath: String,
    httpRequestHeaders: Option[Map[String, String]],
    httpRequestMethod: String,
    HttpPostBody: Option[String],
)

case class ChatResponse(
    callTool: Option[ToolParam] = None,
    toUser: Option[String] = None,
)

The LLM should either response to the user directly using toUser field, or ask the agent to call a HTTP API with callTool field. You can see the ToolParam definition is pretty generic: it can basically do any HTTP call.

For OpenAI, the structured output API only accept a subset of JSON schema definition. So instead of converting the structure to a JSON schema with a single line of Scala code, I need to manually write the OpenAI compatible one.

I also find OpenAI models, gpt-4o-last at least, often failed to generate response that meet the structure requirement even structured output is enabled. You still need to include the JSON schema into the system prompt to get the best chance.

Overall, here is the system prompt to let the system use the tools:

val systemPrompt: String = {
  val timeStr = ZonedDateTime.now().format(DateTimeFormatter.ISO_ZONED_DATE_TIME)
  s"""You are a helpful assistant.
     |
     |The current time is $timeStr.
     |
     |You have many tools to use by sending a http request to some API servers. Your response must be Json that
     |follows the Json schema definition:
     |
     |$chatResponseSchemaStr
     |
     |Either request a call to one of the APIs with `callTool` field, or
     |response to user directly with `toUser` field if there is no need to request to any tool or you need more
     |information from the user.
     |
     |Each tool has an optional authUrl that you can ask the user to open in the browser. If you get authentication
     | related errors when calling a tool, ask the user to open the authUrl in browser and copy the instruction back,
     | then use the instruction to try authentication again.
     |
     |Important:
     |
     |* Response only the JSON body. Never quote the response in something like json```...```.
     |* Never response to user directly without using the `toUser` field with a Json response.
     |* Only one of `callTool` and `toUser` field should be filled.
     |* Always include the `http` or `https` part for the `httpRequestEndpoint` field.
     |
     |""".stripMargin
}

You can see there are some extra points at the end, which are the cases I find the model hiccups very often.

Feed the Tool Information Into LLM

Since OpenAPI can generate a structured document for the API server, either in JSON or YAML, we can feed the document directly into the LLM. In addition to the doc endpoint, we also need to provide the endpoint of the API servers, also an optional authUrl we will talk about later. Here is the definition of the tool in Scala classes, with the prompts:

case class ToolDef(
    httpEndpoint: String,
    openAPIPath: String,
    authUrl: Option[String] = None,
) {
  def prompt: String = {
    val authUrlPrompt = authUrl.map(url => s"Tool login URL: $url\n").getOrElse("")
    s"""----
       |Tool server endpoint: $httpEndpoint
       |
       |$authUrlPrompt
       |Tool's OpenAPI definition:
       |$openAPIDef
       |
       |----
       |
       |""".stripMargin
  }

  private def openAPIDef: String = {
    requests.get(httpEndpoint + openAPIPath).text()
  }
}

After system prompt, the tools prompt is sent as the first chat message to the LLM with a role of developer. I find it works better than put it into the system prompt, maybe because of the tool definition sometimes can be too long:

val tools = Seq(
  ToolDef(httpEndpoint = "https://grpc-gateway.rssbrain.com", openAPIPath = "/swagger.json",
    authUrl = Some("http://app.rssbrain.com/login?redirect_url=/llm_auth")),
)
val toolsPrompt = tools.map(_.prompt).mkString("\n")

val req = ChatRequest(
  messages = Seq(
    ChatMessage(role = "system", content = systemPrompt),
    ChatMessage(role = "developer", content =
      s"""
        |
        |Here are the OpenAPI definition of the tools:
        |
        |$toolsPrompt
        |
        |""".stripMargin),
  ),
)
loop(req, None, waitForUser = true)

Authentication

As you can see from the system prompt above:

Each tool has an optional authUrl that you can ask the user to open in the browser. If you get authentication
related errors when calling a tool, ask the user to open the authUrl in browser and copy the instruction back,
then use the instruction to try authentication again.

We actually take advantage of the flexible of LLMs for our authentication flow: we define a authUrl for a tool server, which user can open in the browser. The URL will do necessary authentication flow, then return a nature language description about credentials and how to use it to do authentications with the APIs.

Ideally, the nature language instruction should be passed to the client in a secure way, for example, through callback to a local URL that the client serves. But for the simplicity of the experiment, I just ask the user the copy the instruction back to the conversation.

So here is what it looks like in the example below:

User input: Get all my RSS folders
Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...

The user asks for the RSS folders, so LLM response with a callTool action. When trying to call the http API, it returns an error about authentication. We feed the result back to the LLM, then it responses to the user:

Assistant: It seems like your request for fetching RSS folders requires authentication. Please log in to your RSS Brain account and provide the token to proceed. You can open [this login page](http://app.rssbrain.com/login?redirect_url=/llm_auth) to login and obtain the necessary token.

You can see the LLM is asking the user to open a URL in the browser. When user open this URL in the browser, the service will prompt the user to login and redirect the user to a page with nature language instruction for LLM. The user copy the instruction back to the chat:

User input: Use `token` param in the APIs to do authentication. Your current token is `XXXXX`.

After added the user input to the LLM, LLM now knows how to fill in callToo params with necessary authentication info and the call finally succeed:

Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
Assistant: Here are your RSS folders:
...

The authentication flow makes it very flexible: the tool server can basically implement any kinds of authentication method as long as it provides a URL that contains the authentication instruction along credentials. You can even create a third party auth server if the original OpenAPI server doesn’t provide this auth workflow, so that you can integrate any OpenAPI service with authentication.

Results

Okay, everything is ready now. Let’s try to use it. Follow the readme to install the dependencies and run the script.

Here is a real world example of using RSS Brain as one of the tool server:

-> ./ai-tool.sc
Compiling /home/wangbin/hobbi_source/ai-tool-proto-experiment/ai-tool.sc
User input: Hi
Assistant: Hello! How can I assist you today?
User input: Get all my RSS folders
Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
Assistant: It seems like your request for fetching RSS folders requires authentication. Please log in to your RSS Brain account and provide the token to proceed. You can open [this login page](http://app.rssbrain.com/login?redirect_url=/llm_auth) to login and obtain the necessary token.
User input: Use `token` param in the APIs to do authentication. Your current token is `XXXXX`.
Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
Assistant: Here are your RSS folders:

1. **Default Folder:**
   - Name: Default
   - Description: user default folder
   - Layout: LIST
   - Order: TIME

2. **news-local:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

3. **local-forum:**
   - Description: N/A
   - Layout: LIST
   - Order: SCORE

4. **news-cn:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

5. **cn-podcast:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

6. **cn-media:**
   - Description: N/A
   - Layout: GRID
   - Order: TIME

7. **news:**
   - Description: News around world
   - Layout: LIST
   - Order: TIME

8. **news-Ukraine:**
   - Description: N/A
   - Search Term: "Ukraine" AND "Russia"
   - Layout: LIST
   - Order: TIME

9. **blogs:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

10. **tech-blogs:**
    - Description: N/A
    - Layout: LIST
    - Order: TIME

11. **tech-news:**
    - Description: N/A
    - Layout: LIST
    - Order: TIME

... (and more)

For a complete list or details about any specific folder, please let me know!
User input: What happened today based on articles in local-news?    
Calling tool https://grpc-gateway.rssbrain.com//rss.ArticleAPI/GetArticlesByFolderWithUserMarking ...
Assistant: Here are some notable articles from today based on the 'local-news' folder:

1. **A short history of airport gold heists (because Toronto's isn't the first)**
   - **Source:** Canada - CBC News
   - **Description:** Reflection on Toronto's Pearson International Airport's gold heist and similar historical events.
   - [Read more](https://www.cbc.ca/news/world/airport-gold-heists-history-1.6817567)

2. **Family of Nunavut woman says she was used to get Ontario twins Inuit status**
   - **Source:** Canada - CBC News
   - **Description:** A story about a family's reaction after a woman is allegedly used in a fraud to gain Inuit status for twins.
   - [Read more](https://www.cbc.ca/news/canada/north/ontaro-twins-inuit-status-claims-1.6814782)

3. **CBC partners with Markham Library to offer youth Radio Camp**
   - **Source:** Toronto - CBC News
   - **Description:** A creative initiative by CBC for youth in Markham, involving sound exploration and radio production.
   - [Read more](https://www.cbc.ca/news/canada/toronto/community/cbc-partners-with-markham-library-to-offer-youth-radio-camp-1.6806360)

4. **Federal government says it will restore funding to Hockey Canada — with conditions**
   - **Source:** Canada - CBC News
   - **Description:** Funding for Hockey Canada is restored with stipulations after controversy over handling of claims.
   - [Read more](https://www.cbc.ca/news/politics/hockey-canada-federal-government-funding-restored-1.6812492)

5. **Man dies after being hit by tow truck in York Region industrial accident, police say**
   - **Source:** Toronto - CBC News
   - **Description:** Tragic incident of a man being fatally struck by a tow truck, with ongoing investigation.
   - [Read more](https://www.cbc.ca/news/canada/toronto/industrial-accident-york-region-tow-truck-death-1.6780204)

For more details, you can click on the links provided with each summary.
User input: Summarize the first article
Assistant: The first article is titled **"A short history of airport gold heists (because Toronto's isn't the first)"** from "Canada - CBC News". It discusses a recent heist that took place
 at Toronto's Pearson International Airport, where $20 million worth of gold and other valuables were stolen. This event is compared to similar high-profile historical heists at airports, highlighting
 recurring vulnerabilities in airport security regarding valuable shipments. The article provides context and details on how such heists have occurred before in different parts of the world.

Since it’s pretty expensive when including the OpenAPI doc, I didn’t try it much. But there are a few things learnt:

Good document is really important. LLM doesn’t include some parameters if you don’t clarify on why it’s needed. I think it’s actually a good thing, since good document is not only for machines, but also for other people.
Structure output is not always forced. The json schema also needs to be in the system prompt to make it more reliable.
APIs need to be designed around the potential use cases. For example, the get articles from a feed API doesn’t have a time range since the UI of RSS Brain is infinite scroll. So it’s hard for LLM to get the articles in a specific time range, for example, last day.

But overall, I think it’s safe to say with the additional authentication flow that can be added by third party, it’s easy to integrate any OpenAPI server as a tool to LLM agent without needing any new protocol.

More About Protocol

I want to add more details about choosing OpenAPI as the protocol. The obvious reason is it has been around for a long time and lots of tools and frameworks support it. But for RSS Brain, I only have a gRPC server. I implemented the gRPC server with my own library scala2grpc and really liked it. However, a downside of gRPC is the serialized message is not human readable. So it’s hard to define a generic call tool structure. Also it’s more difficult to call the API without generated gRPC code. So at the end, I used grpc-gateway to proxy the gRPC server into an OpenAPI server. The code is released in Github.

grpc-gateway has a tool to auto generate OpenAPI structured document, if you have the necessary comments in the gRPC protobuf files. So I added the feature in scala2grpc to add comments based on Scaladoc and it turned out wonderfully.

With all that said, I still think gRPC has its advantage because of the ability of bi-directional streaming. The proto definition is also easier to understand for human, which is a plus for LLMs. Maybe with some more work and leverage things like gRPC reflection, we can also define a generic tool calling action with gRPC in the agent.

Future Work

This is only an experiment, so there are lots of work left to do to make it better.

The first one is to have a better authentication flow, so that the auth URL can give the instructions to LLM agent without user copying it. This is not a hard thing to do but makes it much more secure.

Another thing to try is to not include all the OpenAPI doc in the prompt at first. Instead, only include a summarization or description of it, and only gives more details when LLM agent asks for it. It should be able to save lots of cost.

And there can be more validations for security purpose, for example, when calling a tool or let user open an auth URL in the browser, verify the hostname matches what we have defined in the tools.

It also worth trying other protocols like gRPC since it supports bi directional stream, which may works better in some use cases.

Travel Back To China: 2025 Edition

2025-02-28T00:00:00-05:00

As I mentioned in the last year-end retrospective blog, I traveled back to China again this year (to be more accurate, at the end of 2024). I migrated to Canada just before the Covid, which is why I only started to go back since last year after the Covid measurements were all lifted. I think it’s worth a blog post every time I go back there since China, as a country I spent most of my life, not only made what I am today, but also still has so many connections with me. Most of my friends and extended family members are still there. I also find going back every year instead of living there all the time makes me a better observer. I can notice the changes easier since I’m comparing the things from year to year instead of from day to day. I can notice the things that may already get used to for the people living there all the time. I also have more motivation to observe since the short time of visiting is so valuable to me. So here it is, the 2025 edition of travelling back to China.

Toronto

My wife and I planned the travel dates in October. We decided to go back during the Christmas and new year’s holiday instead of Chinese New Year. Mostly because I can take advantage of some public holidays so I don’t need to take that many days off, and it’s also easier to coordinate with co-workers since most people are not there so it’s hard to do projects any way. However, the flights are really expensive during the holiday season. Most of the tickets need multiple stopovers. Then we thought, why not take the opportunity to visit some places on the way? We stopped in Tokyo for a few days last year when we travelled back, so why not do something similar? We’ve always wanted to visit Europe. So this time, we went big with our plan to stay in London on the way to China, and Paris on the way back to Toronto. So the whole trip is like: Toronto -> New York -> London (stay) -> Shanghai -> Zhengzhou -> My Hometown. Then for the return trip: My hometown -> Beijing -> Paris (stay) -> New York -> Toronto. Crazy, right? But travel to Europe with only additional expense of hotels is such an attractive idea, and the travel last year gives us lots of confidence of travelling with a baby.

Before I went back, I had just finished reading Other River by Peter Hessler. Then I started to read Country Driving again, another book of his written more than 10 years ago. Peter Hessler, as an American, most of his books are about his experience when staying in China, which I absolutely love reading. It’s an interesting coincident that his coverage of China happens to fill the gaps of my own experience: he started his China experience from 1996, about 4 years after I was born. Then he left China in 2010, one year after I started my University life and started to experience and think about the Chinese society by myself. In 2019, the year I moved to Canada, he went back to China again, so his recordings helped me to fill the gap since then. It’s a shame he needed to leave China again in 2021 (or lucky I guess, considering the Covid measurements in China after that). It’s really refreshing to read the perspective of someone from a different culture. Like I said above, it makes the person a better observer. His books also inspired me to write blogs like this to record things in China. Reading the books, it feels China is changing so fast. Even for his newest book Other River, it only covers the early stages of China’s Covid measurements, which seems already outdated after only 2 years. It seems to be impossible to grasp China as a whole. On that level, there is no difference for my blogs: it’s just some observation of a snapshot of China by a person.

Other than reading the books, I didn’t really have much plan for the trip. The work was busy because I needed to wrap up things before leaving for vacation. It’s flu season and combined with vaccines, my daughter gave a hard time to the whole family. We only started to pack before two days of leaving, and only had a very rough plan for the travel in London. But that’s enough: a detailed plan is mostly useless when travelling with a baby. It’s better to adjust based on the situation and plan accordingly, which I’m already very good at.

London

After transferring through New York, we arrived in London in the morning. I was very excited since it’s my first time to be in Europe. We booked the hotel to be in the core of London, so that we can walk to most of the destinations. We took the train from Airport to the hotel. The train stopped before reaching our destination, which was annoying but understandable to me: I know the idea that some trains don’t run the full range and go back early. We took the next train coming, only to find out it’s a different line. Turned out it is really a train station instead of a subway station, which can serve trains to multiple directions. When we realized it after one station, I overheard someone else also took the wrong train, who is also the first time came to London.

Anyway, with this small bump, we still arrived the hotel very early in the morning and my daughter has already slept in the stroller. While it’s not time for check-in yet, we left the luggages at hotel and headed to British Museum. We planned to visit the museum on the first day since the time spent there can be flexible: we know it’s impossible to see the whole collection in a single day, so we just take whatever time is available. Luckily my daughter slept the whole morning, and we ate some (awful) food in the museum after she woke up. It’s really a rare opportunity to be able to see so many valuable collections, especially the ones from very early civilizations like Egypt and Assyria, which I always fancied, as well as some rare collections from China. Without realizing it, we spent much more time than we had planned, until it felt like a rabbit hole and it’s impossible to see all the things I want to see.

There were not so many people on the way when we walked from the hotel to the museum, and I thought that’s just how London is. However, we we went back, we found the streets were packed with people. The shops, restaurants, malls, trees, streets, basically everything, all had very beautiful decorations. The holiday spirit was really magical. I’ve never seen anything like that. Especially with the European style narrow roads, the whole city was like a gigantic Christmas market. At night, we walked in the main shopping areas and the lights are so gorgeous. It felt really magical and unreal.

Other than the holiday vibe, I also enjoyed the culture very much. There are so many civilizations that has long history in the world, but the only ones I’m very familiar with is the Chinese culture and its influences in East Asia. I’ve been to some other countries like US and New Zealand, not to mention Canada, but those countries are all colonies where the new comers don’t have a long history there while the native people’s culture has been long shadowed. But in London, I got the opportunity to experience another totally different culture that has a long history and had great influence to the world. I visited places like Tower of London, Buckingham Palace, National Gallery and so on. But I had the most stunning experience at Westminster Abbey: I didn’t do much research for the travel, so before I went there, I just knew the architecture is beautiful and it’s one of the must see sites in London. But when I entered there, I was shocked to know the burial site is actually in the abbey, just under and beside your steps. This burial style is unthinkable in Chinese culture. The sculptures are beautiful and gorgeous. It felt strange that there are so many great names buried in such a small space. It makes the grand church felt crowded. It’s like walk into a Hollywood party with all kinds of stars, except the people there are magnificently more famous and important: the kings, queens, scientists, writers and so on. The building truly represents the country’s culture and glory: the custom, the history, the religion, the art, and the people.

We hit another similar bump on the way back: we took the wrong train again when we went to the airport. There were so many people in the airport and it’s a total mess. Fortunately we left early for the airport and were able to take some priority line with my daughter, so we didn’t miss our flight. With that and a really long trip, we finally arrived China.

Hometown

Nothing reminds you how fast time passes more than a baby’s growth: last year when we went back, our daughter couldn’t even crawling. This year, she is running all over the places and speak so many words. Watching her learn new things everyday, trying to talk and interact with all the family members brings so much joy to everyone. However, things were not so smooth at the beginning: when we first arrived, my daughter had a mild fever. Nothing too bad and she recovered just after one day. But when we took her to my mother’s place, she was still very tired from the disturbed sleep schedule and jet lag. It took her some serious cry before we could put her to sleep. The crying made my mother and my grand parents very worried. It also made me wonder if it’s worth spending so much time, effort and money to go back just to stay for less than 3 weeks. But after my daughter had enough rest and wake up, she played with the family so happily. On each of the visits, she tried to remember and bubble the words that represent everyone, like grandmother and so on. During one of the visits, my grandfather told us that my uncle, who is in the US, cancelled the plan to go back this year. He said that with anger, which is normal since he is a serious man with a bad tamper. But there was also sadness on his face that he tried to hide but failed to do so. It’s that moment I decided I would go back every year as long as there are no events like a pandemic or a war.

Talking about a war seems exaggerated. But look at Ukraine and Palestine: this is the world we are living in. During my time there, a possible war between China and Taiwan has been brought up as a chat topic multiple times at different occasions. Even when I took a taxi, the driver was watching a video on Douyin (Chinese version of Tiktok) about attacking Taiwan (what’s wrong with the people nowadays that watch videos when driving??). The creator said in a very exited tone that if China attacks Taiwan, the war must be finished in one week no matter what the cost is, otherwise China will be screwed. The taxi driver agreed after watching the video, then out of blue, started to criticize Xi, talking about Xi’s bad education background and how his constitution amendment will make him infamous forever. Anyway, I think Xi wouldn’t start a war as long as he has strong control of China’s politic power. After all, why would he risk the power? It would only happen when the economic is so bad that there is civil unrest, or there is strong challengers to his rule (no one is on the horizon right now), or his old age blinds his judgement. But I also heard the rumor that says the war will happen before 2027, which is the 100th birthday of the Chinese PLA, which made me less certain about my thoughts. After all, it wouldn’t be the first irrational decision Xi has made. That’s how it works in China: without trustworthy media and transparency politic practise, the information is transferred through rumour and the discussion is done with guess.

Most of the conversations like this happened at the dinning table. I had high expectations about the food before I went back. Some food in my hometown are very hard to find in Toronto or even in other places of China, and they are so delicious in my memory. However, the food has been underwhelming this time. Maybe the expectation was too high. Back then when I was in high school, the same street had two high schools with less than 500 meters apart. The schools had a tight schedule: the classes starts before 8:00 in the morning and ends at about 9:00 at night. There is a larger gap in the noon because Chinese students usually take a nap during the noon. But there is a very small gap between afternoon and evening. So naturally, the street become a place that has lots of small restaurants and street food. It was packed with students when the afternoon classes were over, and the food stalls would occupy the roadway. It’s a mix of chaos and relaxing, which is so hard to forget. I spent almost every dinner there during the 3 years of my high school life. I would buy some gaming magazines after dinner, or rent some books, or skip the dinner all together and spend the time to play video games in an Internet cafe.

It’s telling that all those industries about my after dinner activities have (almost) vanished. Let’s first address the weird thing if you’ve never heard of it: book rental. Yes, it was a very popular industry on that street back then. It’s very like movie rental. The books were mostly pirated novels from the Internet. The quality varies but usually on the lower side. It’s perfect for spending time during boring classes, which I did endlessly. But once the mobile phones are popularized in students, all the book rental stores are just gone: since the most popular books in those stores are on the Internet any way. Internet cafes are not so popular anymore but are not totally vanished: even with everyone has their own computer nowadays, it’s still useful for some friends play video games together. Gaming magazine is mostly the same story: there is less business but they are not totally gone. But the one I read the most in the past had closed the business. I only found it out many years later and it made me very sad.

Actually, not only the after dinner activities have vanished, even my high school was vanished all together as well: it’s moved to a suburb like location and the students must live there. The original site became a middle school.

On the other side, street food has lasted much longer. So like before, I went there again but only found the food stalls were all hidden in a nearby street. Curiously, I asked one of them why, and they told me there is an inspection from government for the award of “clean city”. This is typical campaign style governance: when an event like this happens, there will be a campaign across the city to make very absurd policies that makes normal life very inconvenient. The policies are also unsustainable, which is reverted soon after the events passed. It happened so many times in the past so I didn’t feel surprised. But I also found there was no high school student buying the food, which is kind of strange since it’s the time when after classes are over. I asked again and was told the remaining high school adopted “military style management”, so the students were not allowed to go out during the rest anymore.

“Military style” school is trending in my hometown now. I really hate it. I think it’s just many parents don’t want to have the responsibility to raise the kids, and just throwing them into the school with strict rules. Chinese people value education very much. But a lot of them just like buying education, and think that’s enough. Like if they’ve spent enough money for the kids and the kids still didn’t turn out to be good, they’ve done everything they can and it’s not the parents’ fault. However, they don’t realise that parents are a big part of the education and no money can buy that part. And their standard of “good kids” are simple: have good scores in tests and be obedience. This explains why military style schools are so popular: the students spend little time at home so that the parents feel like they have less responsibility, and it sure sounds effective for raising obedience kids. Not all schools adopted that approach, so the business for after school classes are also popular even after the government abruptly banned the practise a few years ago. I even saw some “AI study room”, which is the most weird AI hype business I’ve ever seen, and one of them spells “AI” as “Ai”. I asked my friends what it’s about, and apparently it’s just a place that the parents can drop the kids after school, so that someone can monitor them to finish the homework. Education is surely one of reasons of my immigration to Canada: I don’t want my children to experience the kind of education I’ve had, which somehow even managed to get worse.

I also noticed the security for the schools seems to be leveled up. There are some barriers and police present in front of the school gate when parents picking up kids after school. Maybe partly because of some random attacks happened across China not long before, with some cases targeted students.

After I started the hobby of retro computing, I started to buy more from Chinese used market since it’s much cheaper. This time I traveled with Thinkpad X220 instead of Thinkpad X1 Extreme, mainly want to test how it works when travelling, but also as an opportunity to buy some parts for it. I upgraded it with more RAM and bought a new battery for it. It turned out great: not only it can handle everyday tasks like Internet browsing without any problem, I can also play some light games on Steam. The small size of it makes it very easy to use on the road. Like I said in the last blog post about year end retrospective, I’ll write a more detailed blog about this laptop.

Beijing

Last year when I went back, I only stayed in Beijing for one night. This time, I went there for a slightly longer time of one day. I took the high speed train to Beijing, which is very common in China. It’s fast, cheap and much more comfortable than plane. I really wish Canada can build some high speed trains so that we don’t need to go to the airport just for a one hour flight.

I was able to meet with my cousin and some friends there. The lunch I had with my cousin was unexpected: we found a hotpot place where the brand is supposed to be a very popular one which both of us had multiple times in Beijing. But after ordered the food, we found out it’s a knock off: instead of XX Hotpot, it’s actually Authentic XX Hotpot. Unsurprisingly, the food was terrible. Which is a shame because I don’t have much opportunity to have food in Beijing anymore. But it’s also hilarious that two people that each stayed in Beijing for almost a decade can still fall in to that kind of scam.

Like me, my cousin started to hate staying in Beijing after a few years. I think it’s a common thing between people in Beijing. Unlike southern China, no city in northern China is near the level of Beijing. So naturally, lots of people in northern China go to there for opportunities. However, like I said in the last year’s travel back to China blog, Beijing is a city that don’t care about normal people and can suck the energy out of the people. Because I got to stay in Beijing for longer this time, I experienced it more: after I left Beijing, I sometimes wondered, why I didn’t visit more places in Beijing when I was there, especially I worked from home during my last two years there. The visit to Beijing this time reminds me why: Beijing is a city that is so large and places so separated out. It has excellent subway. So people use it most of time. My day there was mostly spent underground in subway, where you cannot really see the city. Daily life like this makes people feel like they are just a part of a big machine. After many years in Beijing, the city was still like fragments to me. Only after I started to commute by bike not long before I left Beijing, I started to piece together the fragments and had a better sense of the city.

Subway maybe the most used transportation method in Beijing. So there is no surprise security checks happened a lot in the stations. Other than mandatory security checks at the entrance, police also check IDs randomly from time to time, which I encountered just after I arrived Beijing. I ignored them and just passed around when they were checking someone else’s ID and nobody seemed to care. I also encountered two security check points on the road, in a short distance taxi trip I took. One is on the inner city highway, which took a full lane. Another one is on the exit and blocked all the cars. It was evening rush hour, but the security checkpoints didn’t seem to care about the congestion it caused or worsened.

The gap of lifestyles in Beijing can be dramatic. Inside the second ring road, there are lots of ancient and traditional buildings left. The import government institutions are also in there. This is the Beijing when most of the people from other places think about it. However, in real world, there is little people live inside the second ring road. There are lots of companies and universities between 3rd and 5th ring road, which lacks some characters but more like a modern big city. Lots of the people live outside the fifth ring road. From a glance, many places there have no difference from an average small town in northern China. There is usually a big mall around for shopping and eating. Other than that, you mostly need to take subways for destinations. I read the awarding winning sci-fi novel Folding Beijing back when I was in Beijing. It tells a fiction society where three classes are strictly separated. I didn’t have much thought when I read it. But visiting Beijing again, taking subways underground all the day, I suddenly remembered it and found its fiction is so real on so many levels.

The night in Beijing was still cold during the winter. But it’s nice to have dinner with old friends. We met at a place that used to be the heart of China’s IT industrial. The IT companies have moved to a very remote place since then, and lots of big malls occupied the area. I remember the place as a lively area but it was kind of quite that night. Maybe it’s because of the weekday. I was told Covid also hit the commercials big. After dinner, I slept over at a friend’s place, who owns a home but is renting now because his wife was transferred to another location temporarily. It reminds me the time when I was renting in Beijing. During my first months in Beijing, I was searching a place to live with a friend. We took the subway to a potential place that is very rural like. When my friend heard the landlord would pick us up from the subway station with a van, the kind that is usually seen in movies used by kidnappers, he was so afraid and tried to convince me to turn back. Nevertheless, we rented that place at the end. When another friend first came to Beijing, I was going to move away so I gave him the place. He said it was so remote that he felt like he already left Beijing just after the moment he arrived. I still have fond memory about the subway station there: standing in a rural place with dirts, grasses, strong wind, and snow sometimes, without any obstacles, you can see the elliptical shaped station from very far away. It is elevated high above the ground with the subway track going through it. The station is very new and shinny. It’s spacious and bright inside. During the early morning and late night, with the track and bridge hidden in the dark, the only thing you can see is the illuminated station floating in the air, with trains flying in and out slowly. It’s really like a space station in a newly developed planet. It’s the magic node connecting this area to the outside. People rushed in in the morning, packed the train so full that sometimes the subway station staff need to push people onto the train. Then people rushed out during the evening, when the moon took over the sky and climbed above the station. When I had dinner with the friend again and talked about it, he told me that place has been developed so much. There are lots of new buildings and commercials. Partially because the IT industrial in Beijing has moved to that direction. I wondered if the station can still be seen from far away.

Paris

Time is so fast in China and it’s time to leave before we even adjusted. We planned Paris as the detour during the way back. I always knew Paris is a nice city, and France has so many influences to the world: its art, fashion, food, culture, politics and so on. But not until I saw lots of the city scenes of Paris when I watched Olympic games last year , I started to have the desire of actually visiting it someday. I’ve never thought the day would come so soon.

With the experience in London, we carefully looked up the train we needed to take from the airport to the city. This time we didn’t take the wrong train, but instead, witnessed a scam/robbery: when the train was approaching a station, there were three men wanted to get off. I stood with the stroller where my daughter was sleeping, and they squeezed pass me. In retrospective, I felt something was not so right because they were kind of rude. But this is another country so I didn’t know what is normal. There was a couple, maybe in their sixties, stood not far away from me. When one of the men passed the couple, he dropped a coin and started to look it up by lifting their suitcases, with the other two men joined to help. Soon the train stopped at the station and the door opened. Suddenly, the men started to grab the bags and suitcases by force. Luckily, the couple protected their luggages well. Before the robbers could grab anything, it started to attract attentions so they were forced to give up and started to run. Just after the moment I realized what happened, they were already at the door. I thought: “Shit! My suitcase is at the door!” But maybe the suitcase was too big and didn’t seem to worth much, they just glanced at it and escaped without a second thought.

I’ve heard Paris has petty crime problem, but witness one just after arrived shocked me and really let my guard up. Luckily we didn’t have much trouble to get to our hotel. And we only saw another similar accident once in the following days, when two police men chasing another men near Eiffel Tower.

Our hotel is at prime location: just across the Seine river from Notre-Dame. We took the similar plan as we were in London: visited Louvre Museum on the first day. It’s hard to not compare it with the British Museum. Even though I heard the British Museum had much more collections than Louvre Museum, no one can argue the architecture of Louvre is much more majestic. I would say the building itself is one of its most amazing collections.

The weather was not so good there. It was raining on the first day we arrived, then turned very cold on the second day. We visited Notre-Dame very early in the second morning, and just after we went out, it started to rain very heavily. The wind was so strong that it broke my umbrella. I was wet very soon without the umbrella. Despair and frustrated, it suddenly started to snow! I almost wanted to give up the day and went back to the hotel. But the rain and snow stopped pretty soon, so we continued the plan to walk in the city. After we’ve reached Arc de Triomphe and found a place to eat lunch, the sun started to came out. We headed to Eiffel Tower after lunch. The cloud soon cleared up and the city was so bright. The good weather continued through the day. When we walked across Seine river during the evening, the sunset was so beautiful: the water reflected the red sun, with rosy gold color on all the grand buildings nearby: The Conciergerie, Sainte-Chapelle, Notre-Dame and so on. Then we heard a soft song from a street singer on the bridge. At that moment, I thought Paris is really the capital of romantic.

Paris is truly beautiful. In the days there, the amazement only increased when I explored more and more of the city. I’ve never seen such a large area that has so many beautiful architectures in such a consistent style. It’s like being thrown into a totally different world and it’s really a dream like experience. Yet it has so much history. When I visited Chinese cities, I can associate lots of history events and figures to the places I visited, which gave me a much deeper feeling. Few foreign cities had that effect on me (maybe I would count Kyoto as one of them). But in Paris, even I have very limited knowledge of France, I can still associate lots of history events and figures to the places I visited, because there are just so many of them and it’s hard to not know, or at least heard some of them as long as you’ve studied basic history and literature in the middle school. That along makes the city so much richer than just a “beautiful city”.

Overall the experience in Paris is very positive even considering the accidents I witnessed. The food is also surprisingly good. Somewhat like its history, the rich of the tastes is something I rarely got from the non Chinese food. The only thing I wish could be better is how long I stayed there: three and a half days are just so little for a city like Paris. But it’s already what we could do for a vacation with such packed schedule. And considering it’s only a detour, which means we didn’t pay more for the flights compared to go back directly, we couldn’t ask for more.

New York

We transferred through New York again when we flew from Paris to Toronto. We spent lots of time in Paris airport because the check-in agent couldn’t figure out our complex passports, visas and routes. The plane was also late for a little bit so the time left for transferring in New York is very tight. I’ve had so many connecting flights but this time it’s different: we needed to go out of the boarding area when the plane arrived, and do security checks again to enter. More than one bags alerted during the security checks, maybe because of the baby food and snacks we brought. So I was asked to do a full body check, which certainly didn’t help with the tight time. I was asked multiple questions for the consensual, only to find out it’s the kind of the search that is very common in some Chinese airports and train stations, which almost every passenger need to do.

After packing the bags again following the search, I was very tired after such a long trip that I miscalculated the time zones, and thought we could barely make the next flight. So we ran to the gate only to find out there is an additional one hour. We didn’t pick up checked in luggages, because those are usually carried to the final destination with the planes. But it turned out we needed to do so for this transfer. We only knew that after we’ve arrived Toronto when couldn’t find our luggages. I filled out a form to have them be shipped to my home the next day, and went back without those suitcases.

I was so tired after came back, and still dreamed Paris for a few days after. However, the trip turned out very good and it makes me more determined to travel back every year. And I got some good rest and was able to get the time to write this blog post. Thinking about the travel, it’s almost like living two life styles: in Canada, the time is slow, long and peaceful. I spend most of the time on work, research and my own small family instead of socializing. While in China, the time is fast. I spend almost all the time to connect with my extended family and friends, eating outside and so on. Then there is also visiting exciting places on the way. Hopefully I can continue to go back and record it every year in the foreseeable future.

Replace A Dead Node in My High Available Cluster

2025-02-01T00:00:00-05:00

In my previous blogs [1][2], I’ve introduced my high available cluster setup. It works really well: when 1 of the 3 nodes is down, the service either continue to be online, or can be recovered rather quickly (in the case I set service replica to 1 to not wasting the resource). However, in the beginning of this year, one node is down not because of regular updates or temporary shutdown for maintenance, but because its system disk is dead. While it’s annoying to replace the disk and bring it back, it’s actually a good opportunity to verify a dead node can be replaced in my setup. So I will note the steps down in this article. This will be a short one but it shows how easy it is.

What Has Lost?

The dead disk is the system disk. It has the OS, but also has the data for CockroachDB and ElasticSearch. However, since the data for CockroachDB and ElasticSearch is replicated across the cluster, it can be recovered from other machines.

The machine also has a separate disk for CephFS but that disk is not lost. The data in CephFS is also replicated so should be able to recover from other machines as well even if it’s dead. But it may need additional setup, like changing the disk uuid in Rook’s Kubernetes manifests.

Why Not Recover From Backup?

First of all, I don’t backup that often because I don’t feel the need considering the data is replicated. Another reason is, I setup this machine based on the usage of offsite online backup. Then I repurposed it to use in this HA cluster. I want to change the secure boot setup because the threat model is different so it doesn’t need such complex boot setup, which is not supported very well by mainstream Linux without TPM 2.0.

Since the data can all be recovered from other machines automatically, it would be easier to just install a fresh OS and some basic infrastructure so that all the service deployments and data can be auto recovered. This also simulates a dead node situation, so that I have more confidence for recovering from such failures in the future.

How to Recover?

Okay, here we are for the actual recovery steps. It’s very simple:

First, install the OS. Configure basic things like network IP address, ssh, etc. Install things like prometheus-node-exporter if you are using it on other machines.

Next step is to let the node to join our Kubernetes cluster. Before that, we can remove the old dead node in the Kubernetes cluster by using the command kubectl delete node ....

Then install k3s: Copy the config file under /etc/rancher/k3s/config.yaml from another machine and adjust the node IP and network interface config. Make sure the config has something like server: https://...:6443 so it will join the existing cluster instead of creating a new cluster. Check the k3s versions on other machines by using kubectl get nodes -o wide. Then install it with curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.31.1+k3s1 sh - assuming v1.31.1+k3s1 is the version.

After k3s is installed, the k3s service should be enabled by default and the node should join our cluster automatically. If the hostname and IP address are the same as the dead machine, the Kubernetes cluster should automatically reschedule the services on to this machine. If there are some failed containers, check the log to see if it’s because the local directory for the storage is missing. In my case, I need to create the local directory for CockroachDB and ElasticSearch, and set the owner to 1000 for ElasticSearch.

At last, we need to make sure CephFS is working. Make sure ceph and rbd can be loaded with modprobe. If so, add them to /etc/modules-load.d to load on boot:

cat /etc/modules-load.d/ceph.conf
ceph
cat /etc/modules-load.d/rbd.conf
rbd

A 2-Year Reflection for 2023 and 2024

2025-01-13T00:00:00-05:00

Here is another new year! I didn’t write any retrospective for 2023, so I’d like to combine it with 2024 together in this one since 2023 is a very important year for me. I planned to write it after came back from China’s travel but never got the time. When I finally got the time it was already too late to write a year-end review. I’m staying in China again now (when I first started this article but I’ve been back when publishing it). While there is no work related thing to worry about when on vacation, I can use this time to review what happened in the past 2 years so it won’t slip through again.

A New Life

The most significant event happened in the past 2 years is that I have a daughter. Having a baby changed my life so much that it felt overwhelming at first. I always tried to keep my life simple and predictable, but a baby is the contrary of that. Before having the baby, I thought raising a baby is a science. While it is true on some level, it is also a kind of art. In theory, it’s about meet the baby’s need by observing, like feeding the baby, changing the diaper and so on. But in reality, there are lots of guess work about what the baby needs when she is crying, and it’s very stressful if you cannot figure it out. And sometimes she just cry for no reason (e.g. baby colic). Also sometimes you know what needs to be done, but it takes so much energy to do it. For example, in the first months, it’s really hard to get any good sleep and it takes a big troll on everyday life and the mental health.

However, after a few months, especially once the baby is more than 1 year old, it’s really awarding. Once she can bubble words, run all over the place, and express her cares about me, I feel everything is worth it. It’s such a special feeling that knowing you are raising a new life and be responsible for it. Because of that, I feel like my life is more meaningful and I value it much more than before.

The Old Project: RSS Brain

On the other hand, my old project RSS Brain continue to get improvements. I’m so happy about this project because I can put theory into real world after I learned something. And I continue to use it everyday to get information from the Internet. Now I cannot imagine my Internet browsing without it. The feature set and backend is mostly stable in 2023. At the beginning of 2024, I started to refactor the frontend from Flutter to web tech. Mainly using htmx and Alpine.js with Scala to generate the html code from backend. This is such an important decision for this project and it turned out perfectly. Not only it feels native on web platform, the code is also much cleaner since it’s mostly Scala now – the same as the backend. The performance is also much better since most of the html is rendered from the server so the client needs little power compared to a Flutter one. On the way I also developed some frontend libraries for the project’s needs and wrote some blogs.

In the middle of 2024, I started to release the code of this project on a regular basis: I feel like the project is so cool that it’s a shame if only I can see the source code. There is still room of improvement on that side since there are still lots of doc missing, and it uses some other of my open source libraries that may not be released very properly. It would be great to clean it up so it’s easier to build and run on a fresh machine.

There are also some UI and life of quality features that can be continue improved. This would still be a project I continue to work on and release on a regular basis. Sometimes working on these small features is more like gardening, which makes me peace and relaxing when too many other things are happening in the life.

I may also add more AI features. I know AI is kind of a hype word now but that was the plan from the beginning. The name RSS Brain is meant to have some analysis features for the feeds so you can have more insights from them. The features it has now like search filters, related articles are like this. With the advance of the LLM, more things are possible and I’m pretty exited about. But before that, I need to find some proper infrastructure to do that, which will be discussed in the following sections about databases.

Distributed System Infrastructure

With RSS Brain as a use case, I try to setup my own infrastructure for a high available system. It took shape in 2023 and the final version can be found in this blog post. It works really good that I didn’t change much things in 2024. However there is a very important part that’s still missing: the upgrade of the system components. Since the services and even some infrastructure like CephFS are running in containers, there is no single command you can run to upgrade all the packages. Some images can be very outdated that there are some security risks. I meant to write some tools for upgrades but didn’t find the time for it. I suspect I wouldn’t have time for that in the coming year either but just list it here so that I don’t forget.

In Search of Databases

There are 2 things motivated me to search for a new database: the license change of CockroachDB and the needs of vector similarity search. This work took lots of my time in the second half of 2024. For the first point, I talked it briefly in the blog post Jepsen Test on Patroni, which I’m very proud of. I may write another blog post about it too so I will save the details for the future.

For the vector database, I think anyone that follows the AI trend recently would be all too familiar with that. There are 2 use cases for my projects: recommendation for RSS Brain using embeddings, and general RAG for LLMs.

I talked about the recommendation use case in the blog post Update on RSS Brain to Find Related Articles with Machine Learning: it’s now using Elastic Search for vector similarity search, but the performance is not so good with my less powerful cluster. So I’m searching for some other solutions that has lower memory footprint. And if possible, use it for full text search as well so that I don’t need to run Elastic Search at all.

About general RAG for LLMs, I’ve been interested in neural network based machine learning back in 2015. In my 2017 retrospective, I talked about the AI project I was working on. That’s when I read the paper Attention is All You Need, which is the foundation of the nowadays LLMs. I always wanted to develop some RPG game with human like AI and it seems finally to be technically possible. Along with some other tools I want to develop with LLMs in the coming year, a database that is capable for vector search is a must have. I’ve already did lots of research on that area but didn’t feel like I reached to a point to share it. So I’ll save it for a future blog post as well.

Blog Posts

In 2023, I added an index sidebar to my blog. It works really well. The blog posts in the past 2 years not only has better structure, but the quality is also better. My perspective has changed when I write a new blog post: instead of an one-off writing and maybe something I’ll not read again so much, the index structure makes me feel like I’m filling some holes in my knowledge graph, and it’s a forever ongoing project so I take more care about it. And like I mentioned in the previous blog post above, the structure also makes me aware what areas I’m focusing on so that I can adjust it if it’s not consistent with my goal.

Retro Hardware

I always liked the aesthetic of some old hardware. The mechanical parts instead of a flat block with touch screen makes the design much more interesting and fun to use. In the past two years, I bought some retro hardware and it became a hobby:

In the first months of having my daughter, she had reflux so I need to hold her to sleep very often to avoid spitting up milk. In order to make the time easier, I bought a retro mini handheld game console Miyoo Mini+ so I can play games while holding her to sleep. It can simulate lots of games up to PS2 era. It opened a new world to me because there are so many great games in the past that are still fun to to play nowadays. In the past, I’ve already realised good games are not only good graphics. If we read classical books from many years ago, why shouldn’t we treat classic games like that? After flesh the custom ROM OnionUI, I can also write some modern retro games for myself with platforms like TIC-80. The only complain is it doesn’t have bluetooth built in so I cannot use my Airpods (another device I bought to make the time of holding the baby easier) with it.

Another big part of the retro hardware is related to radio: I bought some AM/FM analog radios, wider brand digital radios, as well as software defined radio devices. I got interested in it after read the book The Knowledge: How to Rebuild Our World from Scratch. It just feels so fun with the idea that you can send and capture information through the air. Technically the modern techs like mobile network and WIFI is still sending info through air, but the digital signal and too many layers added makes it less fun. Additionally, when it’s easy to find specific thing you want to consume nowadays, it’s very relaxing to consume passively like the old time TV channels. That’s kind of the idea I described in the blog Random Playlists for Self Hosted Videos. But even more relaxing with radios because you don’t even need to watch anything.

I also bought a second hand Nokia phone Nokia 5730. It’s very similar to my first phone Nokia 5320, but with a slide out keyboard. I wanted to play some old Symbian or Java mobile games but didn’t get much chance. It may stay in the drawer for a longer time because I have lots of other retro plans in the new year.

At last, I got a Thinkpad X220. This may worth a blog post on it’s own (again) so I will just be brief here. This is the laptop I always wanted since graduated (X200 before that). The company I was working for gave us Macbooks, which you cannot really complain. In the following years, I also bought Macbook as my personal laptop because it’s very hard to find the same level screen on other laptops. In recent years, there are multiple times I wanted to buy a used one for various reasons, but didn’t do that mainly because of the price. I finally did it in the last year since I found the second hand market in China is much cheaper than the ones in north America. So I got one pretty cheap and I’m surprised how capable it still is. I meant to play some old Windows game on it but it seems to be such a waste so I installed Linux on it. It works so well that I’m actually taking it for the travel instead my Thinkpad X1 Extreme (gen1).

About old PC gaming, I dual booted Windows on the Thinkpad X220. But then I found my old Thinkpad E40 at my parents’ place. So I will use the Thinkpad E40 for the purpose instead, which is pretty appropriate since it’s the first laptop I had. I played many games on it and there is no better device if I want to play more games from that era.

Conclusion

Overall the life is more structured even with a new baby in my life. One factor may be I’ve moved to Toronto for a while and I’m used to the routine. The end of Covid is another big factor to make not only me but a lot of other people’s life back on track. I think a sentence in one of my poems encapsulated my past 2 years pretty well but it’s hard to translate to English:

抱朴寻微末，结庐草太玄。

“抱朴” (Bao Pu) is referred to the classical Taoist text 抱朴子. Literally, it means “hug the simplicity”. “太玄” (Tai Xuan) is another famous text in Chinese history 太玄经. Literally, 太 means “too”, “supreme” or “great”. 玄 is a very special word in Taoist, which means mysterious, profound or difficult to understand. The author Yang Xiong spent lots of time to write it but it doesn’t really has any practice use and his work was not very mainstream in his era. So the sentence above can be roughly translated into:

Embracing the simplicity, I seek the profound and subtle.

In a humble hut, I write Tai Xuan in solitude.

Jepsen Test on Patroni: A PostgreSQL High Availability Solution

2024-12-02T00:00:00-05:00

Note: code used in this article can be found on the Github repo jepsen-postgres-ha.

I’ve used Cockroach DB for a few of my side projects. I enjoyed it overall. But since it announced license change and require mandatory telemetry collection for free version, I started to look for alternatives. The most nature choice is to just use the plain old PostgreSQL since my data size is not that big and even a less powerful machine can handle it without any problem. One of my important requirements for the database is to have good high availability setup so that I can just shutdown a machine for maintenance from time to time. This series of blog posts will focus on PostgreSQL’s HA solutions instead of why do we need that. Not saying why is not important enough but I’ll save that discussion for another blog post out of this series.

PostgreSQL doesn’t come with native high availability solution. Instead, it has features like replication to support you build your own HA solution. But we all know distributed system is hard to build and error-prone. So I’m planning to test different solutions before I trust my data with them: mainly using Jepsen to test the correctness at first and if it passes the test, benchmark it to make sure it’s usable in real world.

In the first part of this series, I’ll introduce the basic Jepsen test setup. Then use my early test result from Patroni, a very popular PostgreSQL HA solution, as an example. In the test with Patroni, I’m able to:

Reproduce a known issue that causes violation of read committed isolation. This is related to a fundamental flaw in PostgreSQL’s replication implementation.
Observe the cluster failed to recover with 1 node lose out of 3 nodes in total.

Ideally I would like to do more tests and deeper digging into it, but I may not have enough free time in the coming 1 or 2 months so I’d like to record some result here and maybe have some updates later.

Jepsen Test Setup

The tool used in the tests is Jepsen. For the ones who are not familiar with it, it’s a tool to test the correctness of distributed systems. I highly recommend anyone interested in distributed systems to read its analyses, which have found bugs in almost every system it has tested, including PostgreSQL 12.3 with single machine setup. On a high level, it runs queries and check if the data is consistent at the end, at the same time it has many built-in failures (nemesis in Jepsen’s term) can be introduced during the query, like node crash, network partition, network slowdown and so on. The analyses of PostgreSQL 12.3 already does an excellent job to explain how Jepsen tests PostgreSQL, so I’ll not repeat it here. I borrowed the append and read workload from that test but with 2 differences on other parts :

The database is setup in a different way. In the original test, it only tests a single machine PostgreSQL but the bugs are already fixed. So we are going to test a HA setup.
In the original test, it was able to find bugs without import any failures. But since that bug has been fix, we are going to enable different built-in failures like node crash and network slowness to test if the PostgreSQL cluster can still behave correctly or not.

To be more specific, I use Vagrant to create a 3 nodes virtual machine cluster and install Kubernetes (with k3s)on it. The Vagrantfile is here. It’s mostly from a previous project I created to test k3s, as described in the blog post Introduce K3s, CephFS and MetalLB to My High Avaliable Cluster. This setup makes future tests for different HA solutions convenient since most of them supports Kubernetes, so that I can just create yaml files for different systems, while only need to implement Jepsen’s interface once to define database setup, tear down, kill and recover:

For setup, just use kubectl create -f .
For tear down, just use kubectl delete -f to delete the whole thing.
To kill the db, find the root k3s process and kill -9 it along with all its children process. Make sure to also stop the systemd service so it will not be automatically started again.
To recover the db, simply start the k3s service again so that the pods will be scheduled on the node again. It just makes sure the k3s service is started. More health checks are needed if really want to wait for the db to be really recovered but it’s good for now.

Related code is at here.

The code is meant to support any HA setup as long as it can be defined with a Kubernetes manifest. It supports --cluster flag so that can specify which manifest to test. I created a single node PostgreSQL setup and a Patroni setup for now. But in reality, Patroni has some special things that need to be taken care of, like delete PV and endpoints. I’ll clean those things up when my focus is moved to other HA solutions.

For the introduced failures, ideally we should test all the supported failures combined randomly. But the state space is large and need a long time to run. So I just created a specific combination to reproduce a known issue.

A Known Issue of PostgreSQL Replication

When I searched for PostgreSQL HA solutions and whether any of them is tested by Jepsen, I found some comments on Hackernews that says Patroni doesn’t guarantee consistency under some scenarios, which lead me to the Twitter discussion, which stated there is a fundamental flaw in PostgreSQL’s replication that makes it really hard to implement HA without data lose.

Here is the problem: usually with synced replication, a transaction should only be committed and visible after the replica db has persisted the transaction. So that if the primary db failed over to replica, there will be no data lose. But in a special scenario, where the query is cancelled after client sent commit command, PostgreSQL will consider it as committed even the transaction is not replicated yet. So when a failover happens at this time, this “committed data” will be lost from clients’ point of view. Here is an example:

Time	Node 1	Node 2	C1	C2
1	Role: Primary Visible data: k -> [1]	Role: Replica Visible data: k -> [ 1 ]
2	Role: Primary Visible data: k -> [1]		T1 start
3	Role: Primary Visible data: k -> [1]		T1 append 2 to k
4	Role: Primary Visible data: k -> [1]		T1 commit
5	T1 replication started
6			T1 aborted(conn close? client kill?)
7	T1 replication not finished, but T1 is visible to other clients Role: Primary Visible data: k -> [1,2]	Role: Replica Visible data: k -> [ 1]		T2 read k, result = [1,2]
8	Node crash	Role: Primary Visible data: k -> [1]
9				T3 read k, result = [1]
10			T4 append 3 to k
11		Role: Primary Visible data: k -> [1,3]
12				T5 read k, result = [1, 3]

In the example above, the value of k is [1] at the beginning. C1 will append monotonically increasing values to k. (It tracks the value locally instead of query k every time). T1 is aborted before it’s replicated. But even so, the primary node still treat this transaction as committed. So when C2 queries with T2, it get results with [1, 2]. Then at time 8, the primary is failed over from node 1 to node 2, so when T3 queries k, it returns [1] instead of [1,2]. This is an obvious data lose in our point of view because we know exactly the order of events. But one can argue it miss linearizable guarantee since technically, T3 can be ordered before T2 or even T1, and it will produce a consistent history, thus violates linearizable but not serializable. However, with T5 that has the result of [1,3], it creates a situation that conflict with T2:

If T2 is before T5, T5 should has 2 in the result.
If T5 is before T2, T2 should has 3 in the result.

This is not only a violation of serializable, but also read committed because T2 has read the uncommitted data from the client’s point of view.

Patroni Setup for Testing

Even this is a known issue and is documented, I still try to reproduce it in my test for a few reason: first I want to make sure my test is good enough to actually be able to reproduce it. Second, I want to see it happens in real world: the Patroni auto failover makes manually triggering this problem hard because there is only a short time for the commit to be replicated.

In my test, I try to setup Patroni to make it prioritize consistency the most. The config is at here for the Docker’s entrypoint script and here for the config in Kubernetes. The PostgreSQL version is 16 and Patroni version is v4.0.3.

The key configurations are about replication mode. The description of each parameter below is copied from Patroni document about replication modes:

synchronous_mode is set to on: When synchronous_mode is turned on Patroni will not promote a standby unless it is certain that the standby contains all transactions that may have returned a successful commit status to client. Turning on synchronous_mode does not guarantee multi node durability of commits under all circumstances. When no suitable standby is available, primary server will still accept writes, but does not guarantee their replication.
synchronous_mode_strict is set to on: When it is absolutely necessary to guarantee that each write is stored durably on at least two nodes, enable synchronous_mode_strict in addition to the synchronous_mode. This parameter prevents Patroni from switching off the synchronous replication on the primary when no synchronous standby candidates are available.
synchronous_node_count is left to default as 1: The parameter synchronous_node_count is used by Patroni to manage the number of synchronous standby databases. It is set to 1 by default. It has no effect when synchronous_mode is set to off. When enabled, Patroni manages the precise number of synchronous standby databases based on parameter synchronous_node_count and adjusts the state in DCS & synchronous_standby_names in PostgreSQL as members join and leave. If the parameter is set to a value higher than the number of eligible nodes it will be automatically reduced by Patroni.

In PostgreSQL:

synchronous_commit: "on"
synchronous_standby_names: "*"
max_connections: 500

As stated in the Patroni doc, even with this setup it still has the known issue described above:

Note: Because of the way synchronous replication is implemented in PostgreSQL it is still possible to lose transactions even when using synchronous_mode_strict. If the PostgreSQL backend is cancelled while waiting to acknowledge replication (as a result of packet cancellation due to client timeout or backend failure) transaction changes become visible for other backends. Such changes are not yet replicated and may be lost in case of standby promotion.

This is the thing I want to reproduce.

Reproduce Read Committed Violation

The reproduce of this failure is harder than I thought, even I knew exactly the requirement to trigger it at the beginning. There are a few factors contributed to this:

First, if client doesn’t abort the connection itself, it’s hard to reproduce this scenario: Client not aborting the connection means it’s aborted by the primary node, which need to introduce some failures to primary node and that most likely makes it to failover immediately before the failure scenario is triggered. Jepsen’s built-in failures/nemesis are mostly on the server side. While not familiar with Clojure, it took me some time to figure out how to abort the connection just after sending the commit command. The code is at here:

(if (and break-conn (not read-only?))
  (let [result-chan (chan)
        close-chan (chan)
        ]
    (go (>! result-chan (try (run) (catch Throwable e e))))
    (go ( (timeout 120))
        (try (c/close! conn) (catch Throwable e e))
        (>! close-chan true))
    ( close-chan)
    (let [result ( result-chan)]
      (if (instance? Throwable result)
        (throw result)
        result
        )))
  (run))))

break-conn is a boolean that produced from a random number. If it’s not a read only transaction, the client will close the connection after a timeout of 120ms. As I will talk about later, I introduced the network slowdown to make the round trip about 100ms in average, so a complete replicated transaction should take 100ms (client to primary) + 100ms (primary to replica) = 200ms. So 120ms will hopefully makes the connection abort after the commit command is sent to the primary but before the transaction is replicated. This is definitely not the perfect way but it’s the best I can do for now.

Second, the nemesis package I copied from the original test is not very suitable to reproduce this issue: it randomly introduce different failures (passed in by cli flags) by a predefined average interval. So it needs some luck to get the scenario that triggers this failure: a combination of slow network + primary failure. So after I studied the failure scenario again, I changed the nemesis suite to create slow network all the time, and create a 90 seconds cycle for primary failure (30 seconds healthy state + 60 seconds node killed).

Third, the default work load will create multiple keys and run multiple ops in a single transaction. Which makes it harder to trigger the exact scenario (the transaction network round trip time will be different depends on the number of ops so 120ms is less likely to close connection after commit). After think more about the requirement to trigger the failure, I tuned the parameters to operate on a single key most of the time and only have a single operation per transaction.

The other factors are mostly about my setup with VM + Kubernetes. For example, I need to figure out ways to do things like crash the node (k3s root process + children process). I also need to adjust the network interface name in network related failures since the VM created eth1 instead of eth0 hard coded in Jepsen.

At last, you need a little bit luck: the failures is hard to trigger so my test doesn’t reproduce it every time. Just before I decided to give up, it showed the error. The command to reproduce it is in the readme:

for i in `seq 1 10` ; do
  lein run test --nodes-file ./nodes --username vagrant -w append --concurrency 10 --isolation serializable --nemesis packet,kill --time-limit 1800 -r 100 --nemesis-interval 60 --break-conn-percent 0.8 --cluster patroni --key-count 1 --max-txn-length 1 --max-writes-per-key 24000 --nemesis-suite slow-net-kill
  sleep 30
done

Some key params:

--time-limit 1800: run the test for 1800 seconds.
-r 100: send 100 queries per second.
--break-conn-percent 0.8: 80% of the append transactions will be closed after 120ms.
--key-count 1 --max-txn-length 1 --max-writes-per-key 24000: operate on a single key until the operations exceed 24000 times.

As stated above, the test slows down the network the whole time to an average of 100ms round trip. At the same time, it does this loop: wait 30 seconds. Kill the primary node by killing the k3s process and all its children processes. Wait 60 seconds. Start k3s service again.

At last, the outer for loop runs the command 10 times so it has a higher possibility to trigger the failure.

About why run the test 10 times instead of a single round of 300 mins, it’s related to another problem I found during the test which will be discussed in the later sections.

An example of the failure from a recent run:

:workload {:valid? false,
            :anomaly-types (:incompatible-order),
            :anomalies {:incompatible-order ({:key 0,
                                              :values [[4
                                                        6
                                                        8
                                                        9
// ... omitted lines of numbers
                                                        7164
                                                        7171
                                                        7176
                                                        7181]
                                                       [4
                                                        6
                                                        8
                                                        9
// ... committed lines of numbers, the same as before
                                                        7164
                                                        7171
                                                        7176
                                                        8433]]})},
            :not #{:read-committed},
            :also-not #{:causal-cerone
                        :consistent-view
                        :cursor-stability
                        :forward-consistent-view
                        :monotonic-atomic-view
                        :monotonic-snapshot-read
                        :monotonic-view
                        :parallel-snapshot-isolation
                        :prefix
                        :read-atomic
                        :repeatable-read
                        :serializable
                        :snapshot-isolation
                        :strong-read-committed
                        :strong-serializable
                        :strong-session-read-committed
                        :strong-session-serializable
                        :strong-session-snapshot-isolation
                        :strong-snapshot-isolation
                        :update-atomic
                        :update-serializable}},
 :valid? false}

If you check the last 2 numbers for the 2 transactions listed above: one transaction reads [..., 7176, 7181] and another one reads [...., 7176, 8433]. It reproduced the exact problem we discussed in the last section.

One thing to notice is, even though the behaviour of this failure is the same as we discussed in the last section, it doesn’t necessarily mean it’s the same root cause. I say that because this failure is so hard to trigger and I’m not sure yet what exactly triggered it. This is another thing I want to digger deeper into but may not have the time in the near future.

Failed to Recover the Cluster When Only 1 Out of 3 Nodes is Lost

Jepsen generates some graphs after the test. So you can see how the database behaves during the test. The following graph is one of them. It’s the latency of each transaction and the time range of imported nemesis:

The green area at the top means the time that network is slowed down, which is basically during the whole test time in our case. The red part is when the primary node is killed by our test: the lighter red part means the killed node starts to recover by the test until it reaches the white part which means it has been recovered (we mark it as recovered as soon as k3s server is recovered, so it’s not that accurate).

Each square/point represents a transaction. You can see even without primary node being killed, there are still some failed transactions and that’s totally normal in our test: it has errors like conflicted transactions and max connections reached etc. The thing more related is whether there are successful transactions, which is the blue square/point means.

We can see during the first few rounds when the primary get killed, the cluster can recover with 2 of 3 nodes available (the red range). However, when the time passed by, it can only recover when all the nodes are available (the white range). At the end of the test, there are 2 of 3 nodes available (sometimes there are all 3 nodes available because the recovery time are not always the same so the end state of the test can also be different). Just out of curious, I wanted to see how much time Patroni needs to recover in such state, so I left it without make the failed node available and waited. However, after 20 minutes, the cluster is still not recovered.

This means the cluster is failed to recover even if you only lose a minor number of the nodes. After I asked in the Github issue, I got confirmed that Patroni cannot be auto healed from such state. Apparently this is kind of a known behaviour but came as a surprise to me. To be fair, the behaviour of the parameters are documented but it’s hard to realize the implication. Again quoted from the description before:

synchronous_node_count: If the parameter is set to a value higher than the number of eligible nodes it will be automatically reduced by Patroni.

It’s not clear what “eligible nodes” means but seems it means available nodes instead of all nodes in the cluster.

synchronous_mode_strict: When it is absolutely necessary to guarantee that each write is stored durably on at least two nodes …

So it’s two nodes instead of a majority of the nodes: if you have a 5 nodes cluster and you happen to lose the 2 nodes with the most up to date data, it’s not possible to recover the data anymore.

Checking the doc again, seems quorum commit mode is something can help here. But test with synchronous_mode=quorum still got the same result. And from the doc:

On each iteration of HA loop, Patroni re-evaluates synchronous standby choices and quorum, based on node availability and requested cluster configuration. In PostgreSQL versions above 9.6 all eligible nodes are added as synchronous standbys as soon as their replication catches up to leader.

From the test, seems the availability of nodes also affect the quorum.

Back to the question about why need to run the tests multiple times instead of running a single longer test: because the cluster doesn’t recovery by itself during a node lost after a few kill loop, which means more time is needed for the healing between node kills and makes the test less efficient. Also, when wait enough time between node kills, I was not able to reproduce the read committed violation anymore. That’s another reason that I’m suspicious if it is really caused by early connection close or not.

Ways to Improve?

With a quorum based system, there is $V_w$ means the min nodes to write for a write transaction. There is $V_r$ means the min nodes to read for a read transaction. In order to maintain searilizable, $V_w + V_r > V$ needs to be true where $V$ is the number of all nodes, so that $V_w$ and $V_r$ has at least 1 node overlapped. That means when a client reads data from $V_r$ nodes, at least 1 node has the latest data. In our case, for the normal read transactions, $V_r$ doesn’t matter since it only reads from the primary so it’s guaranteed to have the latest data. But when doing a failover, we need to make sure having $V_r$ nodes available because primary is lost and we need to determine which node has the latest data.

In the case of Patroni, with synchronous_node_count can be auto reduced and synchronous_mode_strict only guarantees data writes to at least 2 nodes, $V_w$ is essentially set to 2 which means in order to maintain consistency, $V_r > V - V_w = V - 2$ needs to be true, which means it only tolerates 1 node lose no matter the cluster size. But even with only 1 node lose in our test above, Patroni didn’t implemented auto failover.

So to make it better tolerate node lose, there should be an option similar to synchronous_node_count but actually enforce the minimal synced replication count instead of reduce it based on node availability. And if the available nodes meets the requirement of $V_r$, do the auto failover by comparing the largest LSN on each node.

Minor Issue: Wrong Role Label for Kubernetes Pod

At last there is a minor issue but also the first issue I found during the test: in the Patroni doc, it uses the command kubectl get pods -L role -o wide to show the role of each Patroni pod. However, it is inaccurate as confirmed in the Github issue. It’s not a big deal but something need to be aware when operate Patroni. I think theoretically it may be able to be fixed by let the primary pod set the k8s labels for all the other pods.

What’s Next?

Ideally, I still want to dig deeper into Patroni’s test since it’s a very popular PostgreSQL HA solution. The test above is only a carefully create scenario based on a known issue. Running larger scale tests with more combined failure scenarios may be able to find more failure modes. However, because the fundamental PostgreSQL replicat flaw described above and the effort needed to run the large scale tests, I may want to setup and test another solution first.

The solution is what I had in mind even before I started Patroni’s Jepsen test, which is setup replication with DRBD: instead of using PostgreSQL’s replication, DRBD replicas the whole disk instead. With modern hardware, the performance with replication overhead should be acceptable but it remains to be tested, along with the correctness of it.

SBT Task to Build Frontend Components

2024-09-13T00:00:00-04:00

Even writing a website using something else than Javascript to render content from server, sometimes it’s inevitable to have some Javascript or CSS code. So managing Javascript dependencies and build packages is needed. The easiest way may be just don’t use any tool: download all the dependency files into a directory and import them in the html file directly. That’s what I was doing for RSS Bran before. But it get messy pretty quickly and it’s hard to keep track of the dependencies. So it’s time for me to resolve the problem. Since the project is written in Scala, I’ll note down how I do it with Scala’s build tool SBT.

Frontend Package Management and Build

I put all the frontend related code into a separate sub-directory and treat it like a frontend project. This makes things much easier and less hacky. I use npm to manage the dependencies and use webpack to build it. Here is a simplified example of the code tree structure from my project RSS Brain:

▾ js/
  ▾ css/
      google-fonts.css
      main.css
      pico.jade.min.css
  ▾ dist/
      f20305dee9d396fea5c7.ttf
      f5ef242406fdcf40a232.otf
      main.css
      main.js
      main.js.LICENSE.txt
  ▾ fonts/
      google-material-icons-outlined.otf
      google-material-icons.ttf
  ▸ node_modules/
  ▾ src/
      boolean-checkbox.js
      error-handler.js
      global-htmx.js
      index.js
      match-id.js
      popover-menu.js
      register-service-worker.js
      service-worker.js
      set-theme.js
      source-images.js
    package-lock.json
    package.json
    readme.md
    webpack.config.js
▸ project/
▸ src/
  build.sbt
  LICENSE.txt
	readme.md

You can see other than the js directory, it’s a pretty standard structure for a Scala project managed by SBT.

When look into js directory, it’s a frontend project managed by npm and built with webpack.

js/src/index.js bundles all the dependencies in node modules and local files. Here is an example:

// css

import 'somment/somment.css';
import 'lite-youtube-embed/src/lite-yt-embed.css';
import 'toastify-js/src/toastify.css';
import '../css/google-fonts.css';
import '../css/pico.jade.min.css';
import '../css/main.css';

// js
import './boolean-checkbox.js';

import 'htmx.org';
import './global-htmx.js';

import Alpine from 'alpinejs';
window.Alpine = Alpine;

import * as FloatingUIDOM from '@floating-ui/dom';
window.FloatingUIDOM = FloatingUIDOM;

import 'lite-youtube-embed';
import '@splidejs/splide';
import Toastify from 'toastify-js';
window.Toastify = Toastify;

import DOMPurify from 'dompurify';
window.DOMPurify = DOMPurify;

import 'imgs-html';
import 'somment';

import './error-handler.js';
import './popover-menu.js';
import './match-id.js';
import './set-theme.js';
import './source-images.js';
import './register-service-worker.js';

Alpine.start();

Here is an example of webpack.config.js:

const MiniCssExtractPlugin = require("mini-css-extract-plugin");

module.exports = {
  module: {
    rules: [
      {
        // If you enable `experiments.css` or `experiments.futureDefaults`, please uncomment line below
        // type: "javascript/auto",
        test: /\.(sa|sc|c)ss$/i,
        use: [
          MiniCssExtractPlugin.loader,
          "css-loader",
          "postcss-loader",
        ],
      },
    ],
  },
  plugins: [new MiniCssExtractPlugin()],
};

Since this is more related to frontend tech and is very basic, I will not go too much into details. But the point is, when run npx webpack under js directory, it will build bundled files into js/dist. We will write a SBT task to trigger this command and copy the dist files into resources to package.

SBT Task to Trigger Build and Package Dist Files

SBT is very flexible since you can basically write Scala code to define the tasks. Here we define the first task to install npm dependencies and trigger webpack build (in build.sbt):

lazy val webpack = taskKey[Unit]("Run webpack in js directory")
webpack :=  {
  val workDir = new File("./js")
  Process("npm" :: "install" :: Nil, workDir) #&& Process("npx" :: "webpack" :: Nil, workDir) !
}

It defines a task called webpack, so when you run sbt webpack, it will run npm install && npx webpack under js.

Then we define another task to copy all the dist files to generated resource directory:

Compile / resourceGenerators += Def.task {
  webpack.value
  val file = (Compile / resourceManaged).value / "webview" / "static" / "dist"
  IO.copyDirectory(new File("./js/dist"), file, overwrite = true)
  IO.listFiles(file).toSeq
}.taskValue

Here we added some steps when SBT generate resource files: first we let it run webpack task we defined above, then copy all the files under js/dist to webview/static/dist under generated resources. Here resources means Java resource files, like the files under src/main/resources, but auto generated to target/scala-2.13/resource_managed and will be packaged together as resource files.

So when you run sbt package here, the generated jar package will include all those files as resource files. For example, in my project, the generated jar package have these if you open it with vim (which can view zipped package):

webview/static/dist/f20305dee9d396fea5c7.ttf
webview/static/dist/f5ef242406fdcf40a232.otf
webview/static/dist/main.css
webview/static/dist/main.js
webview/static/dist/main.js.LICENSE.txt

Serve Resource Files in Http Server

Now you can serve the files under webview/static/dist with your web server. Different web server or framework do it differently. Here is an example of http4s:

// include the following route into the http4s web server
// IMPORTANT: every resource file under `/webview` will be public accessible
val assetsRoutes = resourceServiceBuilder[IO]("/webview").toRoutes

Then you can use them in HTML:

 rel="stylesheet" href="/static/dist/main.css">

My MacOS Essentials

2024-08-07T00:00:00-04:00

As a long time Linux and KDE user, I’m pretty uncomfortable with the workflow of MacOS even though I have used MacOS fairly long as well. A lot of companies don’t support Linux to be used on the development laptop. Even for some companies that do support Linux, the hardware for Linux is usually far worse than Macbooks. So MacOS is often the best or even the only choice for work. This is still the case for my new job. I think it’s a good opportunity to write a blog about my MacOS setup. This can be a note for myself when I need to setup a fresh MacOS again in the future.

My Complain about MacOS Desktop Environment

Everyone has different taste and needs about desktop environment and I respect that. The following is just based on my own preference. If you happen to have the same pain points, the setup may help you. Otherwise I find it’s pretty inspiring to see how other people work as well even though I may never work like that.

I mostly just use these apps for work:

A terminal. I use iTerm2 for this. I usually uses tmux to manage “windows” in terminal so I usually don’t open multiple iTerm2 windows.
IDE. Usually Intellij Idea or other JetBrain family products.
Browser: Firefox.
Team collaboration software like Slack and Zoom.

Most of those software are cross platform so I don’t have much complain about the software themselves. The things I want to change are on the desktop environment itself.

There is a thing in MacOS that I wouldn’t be used to in a million years: the logic of windows grouping for the same app. It results these problems:

First, it needs different keyboard shortcut when switch through windows. It just adds unnecessary complexity. Especially with my HHKB keyboard, the ~/` key is far away from Tab key: it’s at the top right corner. And it’s hard to see from a glance what windows are available.

Talking about seeing what windows are available, the dock doesn’t do a good job as well. You can only see which apps are open. And I don’t feel it’s doing a good job even for that. Usually I just end with lots of opened windows/apps that’s no longer needed and it’s hard to keep track of them without a proper panel that shows all the windows.

Make It More KDE/Windows Like

So my goal here is to make it more KDE/Windows like, which means:

Use the same keyboard shortcut to cycle through all the windows, do not group the windows by app.
Have a panel shows all the windows. Again, do not group by app.
This is a good to have: use keyboard to snap windows on the left/right or maximum.

I don’t need the “start menu” since I usually just open apps by bring up the searchable launcher: Spotlight in MacOS and KRunner in KDE.

So here are a list of software that archive my needs:

AltTab: cycle through all the windows without grouping by app.
Rectangle: Windows snap and keyboard shortcuts
uBar or sidebar: KDE/Windows like panel bar to show all windows.

Other Quality of Life Improvements

There are two other software I find very useful even though they are not related to the workflow above.

First, noTunes. It bans the start of Apple Music. I find it’s very annoying that when I accidentally pressed some button or touched my Airpods, the Apple Music popped up. I don’t even know what triggered it. So this software solves this problem perfectly.

The second one is Karabiner-Elements. This is a very powerful custom key mapping software. But I mainly use it to support two keyboards at the same time. That is a very very niche personal need: I use two same keyboards as split keyboard. I can write more on that in the future blogs. But the point is, MacOS doesn’t support two keyboards at the same time very well and this software solves that.

Source Code of RSS Brain is Available

2024-07-27T00:00:00-04:00

This article is also posted at RSS Brain blog.

When I first published RSS Brain, I promised the source code will be released (well, I actually said “open source”, but more on that later). After I rewrote the whole Flutter frontend with Javascript, most code is put into a single source repo. I feel comfortable to release it. So here it is on Github.

There are two things you may notice from the source code:

The commit history is mostly missing.
The code license is not an open source license.

I’ll talk about the most important one first: the code license.

Code License

RSS Brain’s source code is released under SSPL, Server Side Public License. I don’t want to use “open source” as a market point for RSS Brain so I must make this clear first: technically, RSS Brain is a source available software, not an open source one, since SSPL is not recognized as an open source license.

SSPL is mostly the same as AGPL v3, but with a key difference: it requires the user to release the source code of the whole stack if the project is used commercially. If you want to run the code on your own server and use RSS Brain freely, it’s all good. You can even share your server with family and friends. But as long as you start to charge money for that service, you need to release the source code of everything you use for the service, including things like OS, CI/CD, web server and so on. So it basically makes it impractical to use the source code commercially. I chose that feature on purpose.

The Purpose

Before I explain why I chose this license, I must explain the reasons of making RSS Brain’s source code available.

Transparent Algorithm

In a past blog post What Is Wrong about Recommendation System, I mentioned I don’t want to manipulated by recommendation systems. Ant that’s one of the main motivations for me to start write my own RSS reader. While there are still ranking and recommendation algorithms in RSS Brain, it’s aimed to provider better information instead of making the user more addicted to the product. In order to approve that, the algorithm needs to be available so that the users can inspect it and decide whether it’s the right one for them.

Be aware even the source code is available, it still needs some level of trust since the code running on my server app.rssbrain.com can theoretically be different from what is being released. But it’s good enough for most people. However, if you want absolute control, you can always run it on your own server with the source code available.

No Vendor Lock-in

Another important benefit is the user can expect the software to last. Even if I don’t host the service anymore, there is always a way to continue using it since the code is available. Yes it’s not commercial friendly, but if the software turned out to be really useful and attracted enthusiasts, I believe someone else will continue to maintain it for free. I think this property is critical for any product that needs to be used every day and becomes an important part of the digital life.

No Free Commercial Usage

The next benefit is not for the user, but for myself. I want the users be able to self host the service for free, but I don’t want other people take my source code for free and earn money from it. I think that’s reasonable, especially considering I also provide paid hosted solution at app.rssbrain.com.

Considerations of Contributors

One big advantage of open source project is it can attract contributors to make the software better. And it can sometimes justify the free commercial usage because all the competitors are contributing to the software. But because this is a software I am and will use daily, I want to have 100% control of the roadmap of it. Not only the product aspect, but also technology aspect. It’s just easier to write all the things by myself, at least for now. So I’m fine to chose a non open source license even with less potential contributors.

Release Process

You may notice the source code has very few commit history. The release process will be only one commit for each release. The release cycle will be one release every few weeks, depends on how much process I make. The regular releases will mostly on the weekend. If there is a bug or a security risk, the release maybe more frequent.

The version number is in the format of X.Y.Z. Where Y will be increased for every feature release and Z will be increased for every bug fix. X will only be increased for breaking changes or really major update.

I’ll make my hosted version at app.rssbrain.com the same as the source code. Which means at each release, I’ll update the app first, and release the source code just after it. I’ll add a section in the app’s setting page to indicate the current version.

The reason I chose this release mode is the same reason as I released it under SSPL. I only want the source code be available to users, but I don’t really care about whether there will be contributions from other people. So hide the commit history between releases just make my life easier since I don’t need to care too much about keep my commit messages clean.

Roadmap

With this source code released, everyone can inspect the algorithm to decide if this is the right product for them. However, for self-host, even if you can do it right now, it requires some undocumented configuration. So I’ll do the following things to make it easier:

Add documents for self-host.
Add documents for admin operations like create admin users.
Disable some components by default. To name a few:
- There is an machine learning server mentioned in this blog. I will likely disable it by default since I’m thinking about redo this part in the short future.
- Payment is not needed for self-hosted instances so I’ll disable it by default.
- There is an image proxy that I’ll likely to disable as well, just to make it easier to deploy.

A Review of Linux on Surface Pro 4

2024-07-12T00:00:00-04:00

Background

I bought a Surface Pro 4 at 2016. It has an Intel Core m3-6Y30 CPU and 4GB memory. The spec is not that impressive even compared to an average laptop released years earlier. On the other hand, the form factor is very attractive to me: at a very low price, you get a tablet with a beautiful HiDPI 2k screen, a pressure sensitive stylus and an useable keyboard. It is on the heavier side if used as a tablet, but compared to other laptops, it’s very light. It served me very well for my limited use cases. The blog Build a Unix Like Environment on Windows was written at that era. Some years later, I bought a more powerful laptop when I needed to work while traveling. So I gave the Surface away to a family member.

However, during the past years, I couldn’t stop thinking about having a Linux tablet. At first I checked Pinetab, then I realized I had a Surface which would be perfect if I could install Linux on it. I searched online and found some successful stories. So when I travelled back to my hometown at the beginning of this year, I brought the Surface back with me and started to experiment with it.

Use Cases

Before I go further, I need to mention my intended use cases:

Browse Internet. Mainly RSS Brain, the RSS reader I built by myself.
Media consumption: watch videos from my Samba share and online websites like Youtube.
PDF reading: reading only is enough for me but it’s better if I can take notes in the PDF.
Sketches: I don’t have a habit to do handwriting notes even at students era. Nowadays it’s more efficient and readable/searchable to take text notes with Markdown. However, I do like drawing sketches on paper when brain storming or resolving some hard problem. Moving it to digital has a lot of benefits if it works.
Drawing: this is a good to have feature. I don’t really have needs to draw things but it’s always fun. Especially with the development of AI, if I draw something and send it to a more powerful machine to generate images, it could open doors to many possibilities.

Installation

The installation of Linux is actually very easy. I tried two distros and the installation process went very smooth for both of them. The distros I tried are EndeavourOS and Fedora workstation 40.

The installation steps are well documented in linux-surface’s wiki. linux-surface is the Linux kernel and tools for Surface devices. The wiki page has its installation steps as well.

In general, if only used as a laptop, the experience is almost perfect even without the linux-surface kernel. But using it as a tablet is another story.

What Works

Let’s talk about what works first. Even without linux-surface kernel, almost everything works except touch screen and stylus. That includes things like wireless network, bluetooth, keyboard, power profile, UI scaling for Hi-DPI and so on. Multi touch and pressure sensitive stylus works as well (sort of, see sections below) after installed linux-surface kernel. Battery life is good enough: about 5-6 hours of light usage like web browsing, PDF reading, and about 3 hours of video watching. (Just some estimated time from my experience, no serious benchmarking was done).

On the software side, automatic screen rotation is enabled on both distros I tried. KDE with EndeavourOS is very fast and responsive. When the keyboard is detached, it enters tablet mode which makes some UI larger and more user friendly with touch gestures. For example, you can just touch on a folder to open it in Dolphin instead of double click it.

For Gnome, it’s less responsive than KDE but the UI is really beautiful when used as a tablet. I was never a fan after Gnome 3 but I guess the UI changes it made makes more sense on a tablet than on a laptop or a desktop. The overall layout really reminds you about the iPad or Android tablet (in a good way), but with the power of a real desktop OS at the same time. I would really like it if it uses less resource.

Even though the overall experience is positive and has the potential to meet all my use cases, one serious problem made it very unusable and made me gave up Linux on Surface at the end.

The Problems in Both Distros

The deal breaker problem is touch recognition. The problem is in the surface-linux tools so it affects all the distros. The biggest problem is ghost touch: touches are registered randomly even when I do nothing. I tried a lot of workarounds including the ones mentioned in linux-surface’s wiki page, but none of them actually resolved it completely. Sometimes it’s fixed after reboot but reappeared after next reboot. Sometimes it get fixed for a period of time but reappeared after a system upgrade. Sometimes the touch screen doesn’t work at all after resume from sleep. The randomness and the serious of the problem is really annoying so I gave up using it with Linux at last.

Other than the ghost touching, another big problem about touch recognition is palm rejection. It’s really annoying when draw things with the pen. In iptsd (surface-linux’s deamon for touch recognition), there is a configuration to disable touch screen when using a pen but it doesn’t work well. So it makes drawing very unusable.

Both KDE and Gnome has virtual keyboards when the physical keyboard is detached, and works most of the time despite the problems I’ll mention in the following sections. But if you have setup disk encryption with a password, there is no virtual keyboard when you input the disk password, so a physical keyboard is always needed during the boot. Which can be annoying but not really a deal breaker.

The last big problem is battery drain during sleep. It uses about 30% battery for one night even it has been put into sleep. I had similar issues for other laptops. I believe there maybe some configurations I can tune to fix that. But after I gave up Linux on it because of the ghost touch, I didn’t dig deeper into that.

Other than the problems shared by both distros, each distro/desktop environment also has their own problems.

The Problems in KDE with EndeavourOS

The biggest problem in KDE other than the ones I talked above, is the virtual keyboard. It’s buggy and not very stable. Sometimes it kept pop up and sometimes it doesn’t show up. It’s annoying especially at the login screen: if it’s not popped up you will still need a physical keyboard, which prevent it to be a real tablet. Sometimes when the keyboard is popped up, the panel at the bottom cannot be touched. The bugs happened randomly that makes it hard to be properly reported.

Another problem is the touch gesture for right click. Naturally, with a touch screen, long press should be treated like a right click. But that is not the case for KDE. So a lot of operations just cannot be done without a mouse when you need a right click.

Resize a window is also very tricky with touch only operation: you need to touch on the boarder precisely on the first try.

At last, the scroll behaviour is not very smooth. It makes me a little bit dizzy just by scrolling through web pages and PDFs.

So I thought give another distro and desktop environment a try, to see if they can resolve my problems.

The Problems in Gnome with Fedora Workstation 40

I choose Fedora because it comes with Gnome, and I had good experience with it before. After the installation, the first impression is it’s much slower than KDE with EndeavourOS. I found it enables swap and ZRam by default so I disabled them. It’s better but still slower than KDE. It uses more memory at around 40-50% percentage while idel. And I got a lot of OOM kills which almost never happened with KDE on EndeavourOS.

Maybe because of the slowness, it’s also buggy for lots of operations. For example, when switch to the workspace view from PDF viewer with 4 fingers swipe up, the PDF keeps scrolling at the background. And when scroll in the file manager, the context menu keeps popping up.

Other than the slowness, there is a problem on the virtual keyboard as well: the backspace key doesn’t work properly. I found a workaround by install a third-party Gnome addon, but sometimes the old keyboard still popped up.

Go Back to Windows 10

I’d say if the touch recognition works well enough, all the other problems are acceptable with KDE. But with those problems, I finally decided to fallback to Windows 10 again. It works well enough, just as I remembered from years ago. However I abandoned OneNotes and some other Microsoft products and use the following software instead:

Firefox as the browser.
Nextcloud to sync the files.
Samba for video sharing.
Built in video player for local video playing.
Krita for drawing and sketches.
Drawboard PDF for PDF reading.

It’s pretty disappointing that this device cannot be used with Linux properly. But using Windows is still better to just let the device sitting there doing nothing. Maybe I will re-evaluate it after Windows 10 is end of life next year.

Create a Checkbox That Returns Boolean Value for htmx

2024-06-08T00:00:00-04:00

The Problem of Checkbox

htmx is a lightweight Javascript framework. We all know in native HTML, a form element can send a HTTP request to a server with the values of input elements. In htmx, this feature is made more powerful and flexible: you can include the value of any element, and with the help with htmx extensions like json-enc, it can also post JSON data.

However, there is one thing that htmx inherited from the native HTML form behaviour: for checkboxes, it only includes its value when the checkbox is checked. And the default value for checkbox is "on" instead of true (even though you can change it to another value). I understand this decision because it wants to keep the same behaviour so there is no surprise, but it also makes the backend parsing very inconvenient. The checkbox field needs some special treatment at the backend: you need to know there is a checkbox field so that you can set it to false when it’s not submitted with the request, and set it to true otherwise.

In this article, we will explore how to define a custom checkbox element so that it has a boolean value and will always be submitted with the HTTP request. We first explore the implementation for htmx and then for native HTML.

How htmx Submit the Checkbox Value

In order to make it work with htmx, we first need to know how htmx do the HTTP request with parameters. The document doesn’t have a lot of details but we can always check the source code. The code that processes input values is in the function processInputValue:

function processInputValue(processed, values, errors, elt, validate) {
    if (elt == null || haveSeenNode(processed, elt)) {
        return;
    } else {
        processed.push(elt);
    }
    if (shouldInclude(elt)) {
        var name = getRawAttribute(elt,"name");
        var value = elt.value;
        if (elt.multiple && elt.tagName === "SELECT") {
            value = toArray(elt.querySelectorAll("option:checked")).map(function (e) { return e.value });
        }
        // include file inputs
        if (elt.files) {
            value = toArray(elt.files);
        }
        addValueToValues(name, value, values);
        if (validate) {
            validateElement(elt, errors);
        }
    }
    if (matches(elt, 'form')) {
        var inputs = elt.elements;
        forEach(inputs, function(input) {
            processInputValue(processed, values, errors, input, validate);
        });
    }
}

So it checks whether the element should be included through function shouldInclude(elt) and get its value if so (some additional logic for select and file but it’s not a concern here). In shouldInclude, it will only include a checkbox if it’s checked:

function shouldInclude(elt) {
    if(elt.name === "" || elt.name == null || elt.disabled || closest(elt, "fieldset[disabled]")) {
        return false;
    }
    // ignore "submitter" types (see jQuery src/serialize.js)
    if (elt.type === "button" || elt.type === "submit" || elt.tagName === "image" || elt.tagName === "reset" || elt.tagName === "file" ) {
        return false;
    }
    if (elt.type === "checkbox" || elt.type === "radio" ) {
        return elt.checked;
    }
    return true;
}

Create a Custom Checkbox Element with Web Component

I tried to find or write an extension for htmx to include checkbox elements with boolean values, but from what I learnt in the htmx extension doc, there is no good way to do that. So I decided to create a custom HTML element that extends input to return boolean values for htmx to get.

With web component, we can create a HTML tag that can be used just like any other built-in HTML tags. The MDN guide does a good job to explain how to do it so I will not repeat it here. I’ll just put my implementation of the customized checkbox here:

class BooleanCheckbox extends HTMLInputElement {
    constructor() {
        super();
    }

    get checked() {
        return true;
    }

    get value() {
        if (super.checked) {
            return true;
        } else {
            return false;
        }
    }
}

customElements.define("boolean-checkbox", BooleanCheckbox, { extends: "input" });

You can see it’s very simple. It extends the input element. It overwrite checked to always return true so that htmx will always include it in the request. And for value, it returns a boolean depends on super.checked. At last it register the customized element as a tag namedjboolean-checkbox, so that we can just use it like this in HTML:

 type="checkbox" is="boolean-checkbox" />Boolean checkbox

The is="boolean-checkbox" part tells the browser that this is a customized input element.

Here is a complete example:

  
    </span>htmx boolean checkbox example<span class="nt">
    
       type="checkbox" name="default-checkbox" />Default checkbox

       type="checkbox" is="boolean-checkbox" name="boolean-checkbox" />Boolean checkbox

       hx-post="test-post" hx-ext="json-enc">Submit

It defines two checkboxes: a native one and a customized one. We use the json-enc extension so it will post JSON as request body. When click the submit button, if both of them are unchecked, the post body looks like this:

{"boolean-checkbox":false}

And if both are selected, here is the post body:

{"default-checkbox":"on","boolean-checkbox":true}

What About the Native HTML Form Action

The custom element boolean-checkbox only works with htmx to post boolean values. If you use native form action like this:

 action="test-call">
   is="boolean-checkbox" type="checkbox" name="boolean-checkbox">Boolean Checkbox
  

The behaviour is still like the native checkbox, which only posts value “on” when it’s checked.

Even though I don’t use the native form action, it still makes me wonder if I can support it. (Disclaimer: all the code below are experiments and I don’t recommend anyone uses it on production without careful tests.)

In fact, there is a way to set form value in web component through ElementInternals.setFormValue:

this.internals = this.attachInternals();
this.internals.setFormValue(this.value);

However, in HTML standard, ElementInternals is not supported if the custom element is extending a built-in input element. Actually there is a Github issue asking for this feature, and the response to not support it doesn’t make sense to me:

Since Apple’s WebKit team’s position is that customized builtins shouldn’t exist in the first place, we don’t support this proposal.

Anyway, it is what it is. So I need to workaround it. The solution I came up is to include another checkbox element as a child instead of inherit it. Here is the code:

class BooleanCheckbox extends HTMLElement {

    static formAssociated = true;

    constructor() {
        super();
        this.internals = this.attachInternals();
    }

    connectedCallback() {
        this.shadow = this.attachShadow({mode: "open"});
        const internalCheckbox = document.createElement("input");
        internalCheckbox.setAttribute("type", "checkbox");
        this.getAttributeNames().forEach((name) => {
            internalCheckbox.setAttribute(name, this.getAttribute(name));
        });
        this.shadow.appendChild(internalCheckbox);
        this.internals.setFormValue(internalCheckbox.value);
        internalCheckbox.addEventListener('change', () => {
            this.internals.setFormValue(this.value);
        });
    }

    get checkbox() {
        return this.shadow.querySelector("input[type=checkbox]");
    }


    get checked() {
        return true;
    }

    get value() {
        if (this.checkbox.checked) {
            return true;
        } else {
            return false;
        }
    }

}

customElements.define("boolean-checkbox", BooleanCheckbox);

It listens on the checked attribute on the child checkbox and update the form value based on it. static formAssociated = true; is needed so that we can set form values.

Then in HTML, we can use it like this:

 action="/test-call">
   type="checkbox" name="default-checkbox" />Default Checkbox

   name="boolean-checkbox">Boolean Checkbox

When click the submit button, it calls /test-call?boolean-checkbox=false if both checkboxes are unchecked and /test-call?default-checkbox=on&boolean-checkbox=true if both are checked.

Random Playlists for Self Hosted Videos

2024-06-03T00:00:00-04:00

With the development of computer systems and online streaming services, it’s never easier to play TV shows or movies on demand. There are some shows that I watch over and over when I want to relax. But the action of finding a show and selecting an episode makes it less causal. To some extent, I miss the old days to causally open a TV channel just to watch some random things. In this article, I will explain my journey to achieve that. More specifically, I want something like this:

Be able to add videos into collections and play videos randomly from a collection.
Be able to share the collections and videos to other devices including:
- Other desktops and laptops, including Linux and MacOS.
- Mobile devices including Android and iOS.
- TVs like Android TV box.
No transcoding on the server.
The solution needs to be self hosted, free and open.

I think I need to add more details about “no transcoding on the server” since a lot of solutions need that. All my devices are compatible to play the formats in my video collection. So it’s a waste of resource to do another transcoding on the server, especially my video server is also the desktop PC I use the most everyday. If this is not a requirement to you, you may be able to find much better solutions. That’s why I listed everything I’ve tried so it may help someone even it’s not the solution I chose at last.

So here you go. If you just want to see my final solution, go to the last section.

ErsatzTV

ErsatzTV is a self hosted service to create live TV channels and stream them. You can add videos to collections, and put them into schedules. It works very much like a real TV channel. It has a Docker image and doesn’t need any external databases, so it’s really easy to try it out. Once you create channels it can generate m3u8 playlists so that you can stream it on any client that supports it.

It has great features. However, without transcoding on the server and use HLS Direct to stream the videos, there are some problems: I cannot open the stream in VLC. I could open it in Jellyfin, but once it jumps to another video with different format, it stops playing. I need to restart the client which is very annoying.

Jellyfin

Jellyfin is a very popular media server. It’s like the more popular Plex but is free and open source. I was never a fan of Plex since it just doesn’t feel right to self host something you cannot really control. Jellyfin has gone a long way since I tried it a few years ago. You can add TV shows and movies to collections and play random videos from there. Even though it’s not as powerful as the schedule feature in ErsatzTV, it’s still great for my use case.

However, it falls short on “no transcoding on server” part again. The web client can only play a very limited video format. Its Linux client can play most video formats fairly well, but I still need my mobile devices be able to play the videos. The official Android and iOS clients are not any better than the web client. A third party iOS client Swiftlin does much better, but somehow it cannot play from a collection.

Kodi with Samba

I use Kodi on my Android TV box all the time. The videos are shared through Samba. Kodi has an offical add-on to play random videos from TV shows, movies, folders and so on. However Kodi doesn’t have the concept of collection. It has playlist but it’s very hard to use. In theory you can symbol link all the videos to different folders as collections and play from there, but it’s too hacky. Kodi has an iOS client but it’s not in App Store so it needs to be compiled and resigned every a few days.

While I was exploring these ideas, I realized even Kodi doesn’t have any good built in playlist or collection feature, there are some file formats for video playlists. With that, we can even use other video clients. m3u file came to mind at first but the videos paths should be relative paths so it can be played from any device even though the mount point is different. At last I found XSPF which allows relative path for the videos. With that, I came up with my final solution.

XSPF Playlist on Samba with Kodi and VLC

So based on my exploration above, I came out an idea to use XSPF playlist to create collections. I just put the XSPF files in my videos folder to share through Samba together. Since the video paths are relative in the playlist, once you mount the Samba folder on other devices, you can just click the playlist and play it through supported clients.

For the clients, I use Kodi on the Android TV box and VLC on other devices. For Kodi, once Play Random Videos add-on is installed, you can long press on the playlist file to play a random video. VLC on desktops can be configured to always play videos randomly from a playlist. On Android, there is an option to shuffle play once you open the playlist. But strangely, the VLC on my iPad is not able to play the XSPF file. I may dig into that in the future but it’s good enough for me now.

The only part left is to create the XSPF playlist. It’s a xml file so you can edit it manually but that takes too much time. So I created a Scala script to add or remove videos from a folder. Even though I used Scala for a long time and write my side projects with it, it’s the first time I use it as scripting and it’s such a pleasant with the help of Ammonite. The script is on Github so that you can also use it if needed.

Make Flutter Web Apps More Native Like

2024-04-18T00:00:00-04:00

Background

I’ve built the client app of RSS Brain with Flutter so that I don’t need to write different code for different platforms. It’s a pleasant to write Flutter code. And the app works good enough for Android and iOS. However, Flutter web support is a different story. You can feel the app is just not a normal website. I’m not satisfy with that. After attempts to make it more like a native web page and failed, I’m rewriting it with web technology again. That’s why the last blog post is about htmx.

Before I move on, I’d like to record what I have tried, as a note for myself and hopefully it can also help someone else. It’s really sad this article as my first blog about Flutter, maybe the only one for a long time.

How Flutter Renders a Web Page

In Flutter, you define the UI widgets in Dart. And Flutter the engine will parse the widgets and render it to different targets: iOS, Android, web and even Windows and Linux applications. In principle, I think that is a good idea and I really enjoy writing Flutter code compared to Javascript frameworks like AngularJS or ReactJS. It’s really unfortunate the web support is not good enough to me.

The core problem is how Flutter renders the web pages. We all know a web page is represented in HTML. Even if we don’t write HTML directly but use a Javascript framework, it is manipulating HTML tags at the end. Flutter renders widgets to different HTML elements like div at first. However, it was later changed to draw all the widgets in a canvas. (The old render method is still available through --web-render html but I encountered multiple bugs and seems it’s given less and less care). This makes Flutter web apps doesn’t really behave like a native app, because a normal web page doesn’t have everything in a canvas.

For the problems it brings, I found solutions for some of them. For some others, I didn’t find one. The sections below are some of the problems and some of the solutions.

Make Text Selectable

By default, the text in Flutter app is not selectable. You can use the SeletableText widget to make text selectable.

Make Links And Buttons Recognized by Browser

I use Vimium heavily. But Flutter rendering all the content into a canvas makes the clickable links and buttons not recognized by the browser, thus makes Vimium not working. This is a deal breaker for me, especially it’s something I built that breaks my workflow.

I found a solution at the end to make links and buttons recognizable. It can be done by enabling semantics support. Add this line in the main function after runApp:

SemanticsBinding.instance.ensureSemantics();

This will render extra information in HTML instead of only drawing the canvas. It will make widgets like Button recognizable.

However, if you are using something more lower level like GestureDetector, you need to wrap the widgets with Semantics. Here is an example:

Semantics(button: true, enabled: true, child: myCustomClickable)

myCustomClickable will be recognized as a clickable element with that.

Scrolling Behaviour

The scrolling feels choppy sometimes. And because the browser has no idea about the scroll position of the page, it just makes the scrolling behaviour feels different. For example, here is a Github issue opened 4 years ago describing this kind of problem and is still not resolved. For me, this is the last straw to make me give up Flutter, since it breaks scrolling keyboard shortcuts of Vimium.

Conclusion

The idea behind Flutter is great. I hope the web support can be better and better so that I can finally come back to it one day. But for now, I cannot wait for it and need to take another route. Stay tuned for more updates about that journey.

Prevent htmx Lazy Loaded Content From Reloading

2024-03-26T00:00:00-04:00

This is a short article about some tricks in htmx. I have more to say about htmx but I’ll save that to another blog. In this one, I will skip the basics about htmx and assume you already know that.

1. Problem

I’ll briefly introduce two features of htmx in order the explain the problem. You can go to official website for more details about the features.

1.1. Browser History

htmx has a feature to interact with browser history. Here is an example in the official document:

 hx-get="/blog" hx-push-url="true">Blog

This will change the url in browser to /blog when you click the link and save a snapshot of current page into local storage. When you click back button in browser, htmx will try to find the cache in local storage, and swap it out so you don’t need to reload the whole page.

1.2. Lazy Load

htmx sends requests when an event is triggered on an element. The rule is defined by hx-trigger attribute. There are some special events can be used for lazy loading:

load - triggered on load (useful for lazy-loading something).
revealed - triggered when an element is scrolled into the viewport (also useful for lazy-loading).
intersect - fires once when an element first intersects the viewport.

However, when combined this with history support, the lazy loaded elements will be requested again when the pages are navigated in history. Here is an example:

 hx-get="/page1" hx-push-url="true" hx-target="#content">page1
 id="content" hx-get="/content" hx-trigger="load">

When you click on page1, it will replace #content with the response from /page1 and change the URL. However, when you click on back in browser, htmx will send a request to /content again even though it’s already in history cache, because technically, #content is loaded again so hx-get is triggered based on hx-trigger rule. This results a waste of resource and can sometimes make the webpage lost previous scroll position.

In this article, I’ll show some tricks to prevent this. They are very simple once you know them but sometimes it’s just hard to get when you are new to the framework.

2. Best Solution: Swap Outer HTML instead of Inner HTML

I think this is the best solution. It’s so simple that I don’t know why I didn’t get it earlier. Anyway, that’s why I write this blog so that it can help more people like me.

By default, htmx swap the inner HTML of the element. So the hx-trigger="load" attribute is still there after the content is loaded and will be triggered again when load from history. The solution is to just let htmx swap the outer HTML instead. Using the same example, the code will be changed to this:

 hx-get="/page1" hx-push-url="true" hx-target="#content">page1
 id="content" hx-get="/content">
   hx-get="/content" hx-trigger="load" hx-target="this" hx-swap="outerHTML">

In the new implementation, we have another div tag inside #content to do the lazy load. After the response is loaded, it will swap out the whole div element so hx-get and hx-trigger are not there anymore when the snapshot is taken and loaded from history.

As I said, this is the best solution in my mind and I think it fits all the cases. So if you only care about the solution, you can stop reading here. I record the following solutions simply because I figured them out earlier than this one.

3. Solution B: Don’t Snapshot the Whole Body

The solution above removes the htmx attributes. The solution in section tackles the problem in another direction: it prevents the element from loading again when go back in history.

By default, htmx will take the snapshot of body and put it into history cache. That’s why when go back in history, the load event of the element is triggered again. To prevent it, we can let htmx only snapshot children of #content. Here is the official doc about how to do it. Using the same example, the code will be changed into:

 hx-get="/page1" hx-push-url="true" hx-target="#content">page1
 id="content-load" hx-get="/content" hx-trigger="load" hx-target="#content">
>
 id="content" hx-history-elt>

Here we load the content with #content-load element. htmx will only swap out #content when we forward or go back in browser history since we added hx-history-elt on #content. This prevents load event from being triggered on #content-load so it will not send a new request.

But this solution has great limitations: you need to change the snapshot element which is not always possible.

4. Solution C: Remove htmx Action Attributes Before Taking Snapshot

This is a solution that could work in theory but I didn’t test it, because I came up with the best solution when thinking about it.

The idea is similar: we don’t want htmx action attributes like hx-get when we load the history. Other than swap the whole outerHTML, there is a htmx event you can catch in Javascript to remove the attribute before taking a snapshot:

htmx.on('htmx:beforeHistorySave', function() {
  document.getElementById('#content').removeAttributes("hx-get"))
})

Travel Back to China

2024-03-19T00:00:00-04:00

After more than 4 years staying aboard without being able to go back to China because of Covid, I finally had the chance and spent this Chinese New Year at my hometown. Now I’ve come back to Toronto, it’s time to record it when my memory and feelings are still fresh.

Before We Go

Being able to travel back doesn’t mean it’s easy. The number of flights between China and other countries are still not recovered to pre Covid level. In order to keep our budget in a reasonable level, we need to fly through 2 stops, and then drive more than 3 hours to home from the airport. The flying time and the waiting time at airport combined is more than 24 hours. It’s a very long trip for any adult, not to mention traveling with a 6 months old baby. I was very nervous about that since our longest trip with the baby was taking her to the clinic. But since we haven’t been back for so many years, this is a travel that shouldn’t be delayed anymore. On the bright side, my sister in law will come to meet us at the first stop Tokyo. We will stay there for a few days for resting and sight seeing.

Things were not smooth before we go. First, our company had a bad outrage during the holiday season so I needed to work overtime. Following that was hot debates about the following steps to make our services better, which made me very frustrate for reasons I’d rather not to talk here. And during all these things, the whole family also caught cold and had fever for a few days. It’s the first time the baby is sick so it’s very stressful situation. The baby awoke every 2 hours at night while my wife and I were sick. When everyone finally recovered, we barely had the time to get new Covid vaccines and pack the baggages.

Anyway, we successfully handled everything before we go, took the 15 hours long flight and headed to our first stop, Tokyo.

Stay in Tokyo

I had been to Tokyo twice. But they are both many years ago and I didn’t have enough time to see the whole city. Even though it’s a relatively short visit again this time: just 3 - 4 days, it still got me excited to be there again.

We wanted to book a hotel near Asakusa (浅草) area since Sensō-ji is a must see site in Tokyo. We also want a subway station nearby. At last we found a place near Ueno (上野) station. It’s a traditional Japanese style hotel that has Tatamis, which is perfect for us: our baby doesn’t like to sleep in the crib anymore, so Tatamis is much safer since she cannot fall from it, and the mattress is also much firmer than the ones in regular hotels, which prevent the baby’s face from buried into the mattress. And she can also play on it during day time as well.

Only after I booked the hotel, I found out Ueno area is a place I wanted to visit but didn’t have enough time last time: when my wife and I visited Tokyo last time, we planned to take the train from Ueno station to the airport. We didn’t notice the schedule of that train is less frequent than subway. So we didn’t plan the time ahead and missed it. Disappointedly, we decided to have some food near Ueno station first. That’s when we found out the area around Ueno: there are many pedestrian streets filled with street food, outdoor eatings, restaurants, shops, and people. I was so fascinated by it and it was a shame that we didn’t have enough time to explore the area since we needed to head to Kyoto at that day. I forgot the name of that area since then because we were in such a hurry and we visited so many places in Japan after that. But when I was checking the surrounding area of the hotel on Google Maps, the Ueno station struck my memory and I was so excited that I had another opportunity to fully explore that area.

So Ueno and Asakusa are where we explored most when we were at Tokyo. We went to the Tokyo National Museum and enjoyed the ukiyo-e (浮世絵, wood block prints) exhibition that I wanted to see long before we went. We ate some delicious food at Ueno area. We also bought some electronic devices and manga books at Akihabara (秋葉原). What I didn’t expect was the experience at Asakusa: last time I only visited Sensō-ji and the street in the front of it. I didn’t know there is a larger area surrounding it that has lots of traditional Japanese style buildings, shops and restaurants. We found a shop by coincidence that sells high quality ukiyo-e prints. There are many places selling them in Tokyo, but they are either low quality or too expensive. I’m so glad to find a shop that sells lots of high quality prints in reasonable price range.

The whole experience in Tokyo is very positive. The mix of tradition Japanese and modern culture creates a very unique vibe. Because Japanese culture is largely impacted by Chinese culture in the past, I think I can appreciate more of the beauty of it. I have a fresh eye when I looked at the city after I explored more on the topic of city design in the past years: the non car centric culture, high quality public transit and high density of population makes it very different from North American cities. The city is more vibrant, much cleaner and safer, and have so many interesting places to explore. But unfortunately, Japan is a country better for visiting than long term living for foreigners because of its (almost non-exist of) immigration culture and stressful working environment.

The biggest happy surprise we got from Japan is our baby can sleep the whole night! It’s such a life changing improvement for my wife and me. After having the baby, I felt like there is nothing more important than being able to sleep a whole night. It’s so great to have that back!

On that happy note, we continued the trip back to home.

Back to Hometown

I thought there would be lots of feelings on the road to home. But there wasn’t. Maybe because of there are too many concrete things to worry about with a baby on the road so it left little room for feelings.

With the things happened in the past years in China: the lockdown of cities during Covid, the re-election of Xi which broke the political practice, the protests of both the re-eleciton and lockdowns, the broke of Evergrande Group, and the downhill of America-China relationship, you’d imagine China is in a pretty bad place. However, when I went back to home, I found things were not as bad as I thought. Yes, economic has gone bad: there are lots of unfinished buildings, small businesses are struggling, it’s harder to find a job for new grad, nationalism is on the rise and so on. But on another hand, at least from my limited experience, people’s life is still going on. I saw there is disapproval when people talk about the economic and government policies, but I saw little desperation feelings. When there are fewer ways to make life better, people continue to find new ways. For good or for bad, that’s the resilience of Chinese people. Maybe it looks better than normal because it’s holiday season: the malls and restaurants are packed with people. Beautiful decoration lights are everywhere. There are fireworks everyday.

Theoretically, fireworks have been banned for many years in China. However, with the lift of lockdown at the end of last year, there were lots of celebrations with fireworks at the new year (not the Chinese New Year) and created some conflicts between the crowd and the police in some cities. After that, the ban of fireworks still exists but is rarely enforced. Fireworks in the new year’s eve has been a tradition since ancient time. But in my opinion, the mixed feelings it brings represents the complexity of contemporary China perfectly: It’s believed to be able to dispel bad luck, which is much needed after the Covid and the following weak economic. It extends to some level of superstitious that some people believe it can cleanup the virus in the air. It also seems to be a subtle way to express disapproval of government policies because the ban is still in place. Of cause there is also pure excitement about the lights and sounds, the happiness about holiday, and wishes for a better year ahead.

Another reason of the weak economic not showing much trace may be my hometown is a small city so the trend is kind of lagged behind. It’s still benefiting from the development of the bigger cities in the past years: more and more big brands and chain stores are opening so there are more choices when buying things. Food delivery is more convenient. There are also more culture innovation products with better traditional Chinese aesthetic. If not considering education and healthcare system, the everyday life has little difference from big cities, or even better because of the less stressful working environment.

Not all things are good. Not mentioning the things that were already there before I left China, there is one new development that would trouble me a lot if I lived for a longer time: the lack of privacy both in the real world and in the cyber space. In the real world, cameras are everywhere. Lots of people start to use smart locks on the door that has a camera that you cannot avoid when you pass by. You must have scan the face in order to enter some residential compounds. Every crossing has high resolution cameras recording license numbers of cars and are able to recognize the drivers. In Beijing, face is recorded when entering every subway station. Even worse, when I was playing arcade games in a mall, the arcade machine has a camera that took a photo of me without reminding me first. On the cyber space side, there is little service you can use without installing an app and register an account that linked to your phone number and in turn linked to your ID. The worst experience I had is at a parking lot: there was no person at the exit and you need to scan the QR code to register an account, input personal details and pay the fee in order to leave. Again, there was nothing reminds you that before you actually try to leave and scan the QR code. I guess it’s not like there is no one in China cares about the privacy, it’s more like an already lost battle because the desire of surveillance from both the government and tech giants, and the lack of power to balance that.

Another thing I dislike is the trend of city planning. I think the city did a very good job in the past: reasonable density and mixed use was very well maintained. There are dedicated bike lanes, wide sidewalks and reasonable public transit coverage. However, with the widely adoption of cars, things got worth and the city seems just want to change things in the name of changing. That’s kind of understandable because there are more opportunity for corrupt when there are more projects. But at the end, parking lots replaced lots of green spaces on sidewalks. Roads has been re-designed with confusing turning lanes which replaced some bike lanes. Traffic lights replaced lots of roundabouts, and even worse, sometimes traffic lights are combined with roundabouts which is totally unnecessary. If the changes are limited because of the old foundation, then it’s not surprise that the worse place happens at the newly developed areas. It’s mostly all high rise residential buildings with little commercial uses. That’s kind of understandable as well since one of the main income of government is by selling land to developers. Seems like there were some commercial uses planned but the progress get delayed because of the real estate crisis. But the most ridiculous part is the roadway network design: there are many very wide roads. Many of them have 10 lanes! And some of them even have additional 2-3 lanes service road on each side. Be aware those are not highways. In a grid layout, those very wide roads are just beside residential buildings and are connected without skipping any crossing. It’s such a waste of resource because if there are so many cars that such wide roads are needed, then the non exist of road hierarchy doesn’t make any sense. Combined with the lack of commercial uses, it makes people rely more on cars and makes traffic very bad for commercial areas. Just go outside for a walk like the old days is not enjoyable anymore in the newly developed areas.

Despite all those things, it’s still a vacation at my hometown. So my mind was laid back even though I was very busy physically: my wife’s sister got married just days after we arrived. My wife and I also had the wedding that was planned years ago but got pushed because of Covid. I’m very happy how the wedding went considering we need to take care of the baby at the same time. If we were not preparing for the wedding, we took the baby to my parents and my in-law’s places. Between the gaps, I also needed to find some time to meet with friends that I haven’t seen for a long time. So it’s a very packed schedule but it’s so different (in a good way) to be close with family and friends again. Being aboard so many years and having a baby gave me a new perspective of the importance of family and friends.

However, I couldn’t stay there for long. I left half a month’s parental leave for the travel. But even combined with that, one month is basically the most I can have for a vacation and the company doesn’t allow work from China. So even we felt like we haven’t spent much time at home yet, we needed to go back. The trip back to Toronto has 3 stops on the way. So it’s another battle to fight. Our first stop is Beijing and we will stay at the airport for one night.

One Night in Beijing

I lived in Beijing for 8 years. It’s the second longest city I’ve lived in, just behind my hometown. It’s the longest if considering only the time of adulthood. So I have lots of memory and friends there. It’s unfortunate that I can only stay there for one night but it’s better than nothing.

Just before the day of leaving for Beijing, there was a snowstorm and most highways were closed as a result. We booked the train from a nearby city because the time of the train is better. But since the highway was closed, we changed the departure station to our hometown city. It was a very cold morning and we needed to leave for the railway station at 5:00am. When we were waiting at the station, there was an announcement that said the train was delayed. Following that, there were more announcements and the train was delayed longer and longer. Luckily, while debating if it’s better to go home instead of waiting in the station, the delay got shorter and we were finally able to aboard the train.

Things got better after this rocky start. We took the subway to the airport after arrived at Beijing since we didn’t have the baby’s car seat with us. It’s mostly underground on the way so I had little opportunity to see the city. It’s almost time for dinner when we arrived the hotel in the airport. If not because of the delay of train, we could arrive at noon. I made the plan to meet some friends and have dinner together. I left early from the hotel to walk around the city before I meet them.

I took the subway to Sanyuan Bridge (三元桥). It’s the northeast corner of the Third Ring Road and only one stop away from the airport by the airport express line. I went to a mall first. It’s still early for a weekday so it’s a little bit quite there. I decided to walked to a nearby subway station Liangma Bridge (亮马桥), which is a place surrounded by many embassies. I met my friends there and walked to the nearby restaurant together. Two of my jobs were at that area so I have lots of memory there. Walking along the streets at night, everything feels very familiar but also has a sense of distance. There are lots of restaurants and shops disappeared or changed owners, but the base layout is the same. While taking the subway, walking along the narrow road that has barriers to separate it from the Third Ring Road, going through the underground tunnel, I recognized the familiar feeling: Beijing is like a big machine or beast that doesn’t care about normal human beings. The city is not designed with human scale. Multiple ring road highways cut through the city with giant crossing bridges. But that doesn’t make driving easier because the traffic is still bad and only limited cars can be on the road at weekdays based on the license number. The public transit is wonderful compared to North American cities and most of the people use it. But the subway is usually packed with people during commute hours and stations have maze like paths for exits and connection to another station. It can suck all the remaining energy out after a work day. The pace of life is fast and people are busy. It’s not an enjoyable city to live. But it’s still the capital of China and is the biggest city of the north. There is no shortage of people live there with the hope of a better life. I was one of them and it gave me valuable adventures. I don’t love the city but I love the memory with it.

Back to Toronto

We left early next morning for the flight to Tokyo, then to Montreal, then to Toronto. We had the opportunity to fully explore the airports at Tokyo and Montreal because we left enough time in between of the flights. The longest flight from Tokyo to Montreal is full but fortunately the baby was able to sleep most of the time. We arrived at Toronto at night which means it’s another morning in China and it has been more than 24 hours since we took the first flight from Beijing.

The whole travel went much better than I thought. We not only see the family and friends after 4 years, we also have the experience of traveling a long distance with the baby so it opens so many possibilities in the future. It would be great if I can stay longer with family every year in the future.

Scala 2 Macro Tutorial

2023-12-29T00:00:00-05:00

Macros are powerful but complex. Especially when the language itself like Scala is already complex. The lack of learning resource and documents makes it more so. In this article, I’ll write down some of my learnings and hopefully it can help someone else who is new to it as well. I’ll keep the examples small and simple so it’s easier to understand. Since I’m still learning it, I may continue to update this article on the way, or write a new article if there is a big topic. Either way, I’ll make notes here so you know there are updates.

Scala’s macro syntax and APIs can be different from version to version. Especially it’s almost completely redesigned in Scala 3. This article only targets Scala 2 and I’ve only tested the examples on Scala 2.13.

1. What is Macro

The basic idea of macro is to modify the code with code. For example, let’s imagine a macro plusToMinus that modifies all the plus operations of integers to minus:

plusToMinus { 1 + 1 }

This will be compiled to 1 - 1 and ends up as 0.

Of cause this is not a practical example and not all the languages’ macro system can do it. But this demonstrate what macros can do where normal code cannot. Here is a more practical example: when we log something in different log levels, the API usually looks like this:

val v = ...
logger.info(s"This is a info log. Value: $v")
logger.warn(s"This is a warning. Value: $v")

However, with this kind of interface, the string s"..." need to be computed before passed in to the method, which is a waste since not all the strings need to be logged based on the log level configuration. Especially when v.toString needs a lot of resource to compute. So in language like Java, the values are usually passed in as separate parameters:

String v = ...
logger.info("This is a info log. Value: {}", v);
logger.warn("This is a warning. Value: {}", v);

Even though it resolves the problem, the interface is kind of awful. And not all the users know this kind of details so they may still just construct the string directly instead of pass in separate parameters. However, with the help of macros, you can still keep the logger interface in the intuitive way. As macros, logger.info and logger.warn can modify the code directly during the compile time. For example, it can modify the code like this:

From

logger.info(s"This is a info log. Value: $v")

if (loggerLevel >= INFO) {
  println(s"This is a info log. Value: $v")
}

So that the actually string computation is not done unless log level is configured to print it.

2. How to Write a Macro

Different languages have different syntaxes to write a macro. On the simpler side, macros in C can only do text substitution. On the powerful side, Lisp languages can modify the AST (abstract syntax tree) very easily because the code itself is written as a tree structure. The macro in Scala is on the powerful side since it is able to modify the AST even though it may not be as intuitive as Lisp. There are multiple ways to do it. But essentially, the process it to take the current AST as input and output a new AST. The APIs of reading AST input is very similar to reflection APIs (and in fact, sometimes they share some APIs). Generating a new AST part is more complex. In the following sections, we will walk through how to setup a SBT project to write macros, how to read an AST and how to generate a new AST.

3. Project Setup with SBT

In Scala, the implementation of macros and the use of macros need to be compiled separately. So if you are using SBT, they need to be in different sub projects. Here is an example of build.sbt:

lazy val root = (project in file("."))
  .aggregate(core, coretest
  .settings(
    name := "archmage"
  )

lazy val core = (project in file("core"))
  .settings(
    name := "core",
    libraryDependencies ++= Seq(
      "org.scala-lang" % "scala-reflect" % "2.13.12",
      "co.fs2" %% "fs2-core" % "3.9.3",
    )
  )

lazy val coretest = (project in file("coretest"))
  .settings(
    name := "core-test"
  ) dependsOn core

It creates two sub projects. You can implement the macros in core and use them in coretest.

If you want to debug the generated code from macros, add debug flags to Scala like this in build.sbt:

ThisBuild / scalacOptions += "-Ymacro-debug-lite"

4. How to Read AST

4.1 Read macro parameters:

Here is the basic syntax of a macro. First, define a macro implementation:

def macroImpl(c: blackbox.Context)(s: c.Expr[String]) : c.Expr[String] = {
  println(s.tree.symbol.fullName)
  s
}

The first parameter c: blackbox.Context is a must have for a macro implementation. There is also a whitebox.Context but we will not cover it in this article. More details about whitebox can be found in the official document.

The remaining parameters of the implementation method are parameters for the macro. For example, if you want to take a parameter of type String for the macro, then the implementation of macro will take c.Expr[String] as a parameter, which c.Expr[String] is the tree representation of the macro’s String parameter. The same applies to the return type of the macro. You can also use c.Tree instead of c.Expr[T]. They can be converted between each other, which we will see in section 4.4.

This example prints out the variable name of the passed in parameter and return the parameter without modification. Note that the printing happens at compile time since that’s when the implementation of the macro is ran. Only the returned tree or c.Expr is used at run time. So this macro is not doing anything useful, it’s just a demo of how to read the input tree.

Once we have the macro implementation, we can define the macro like this:

def macroTest(arg: String): String = macro macroImpl

Then we can use it in another (sub) project so that the compilation is separated:

val a = "abc"
macroTest(a)

It will print out the full name of a like this during compilation:

me.binwang.archmage.coretest.MethodMetaTest.a

The API of c.Expr is very similar as reflection API. You can experiment with it by print out different things from it and see what you can get.

4.2 Read type parameters:

Macro can also take generic type as parameters. The example below takes a parameter of any type and print out its type at compile time.

def macroImpl[T: c.WeakTypeTag](c: blackbox.Context)(s: c.Expr[T]) : c.Expr[T] = {
  println(c.weakTypeOf[T])
  s
}

def macroTest[T](s: T): T = macro macroImpl[T]

Which can be used like this:

macroTest("abc")
macroTest(1)

The output during compilation will be:

String
Int

4.3 Read implicit parameters:

Macro can have implicit parameters, but the macro implementation shouldn’t define them as implicit. Otherwise Scala compiler will give confusing errors. See this issue for more details.

In the following example, macroTest takes an implicit double variable and return it as the new generated tree:

def macroImpl(c: blackbox.Context)(s: c.Expr[String])(num: c.Expr[Double]) : c.Expr[Double] = {
  println(s"Name of implicit num: ${num.tree.symbol.fullName}")
  num
}

def macroTest(s: String)(implicit num: Double): Double = macro macroImpl

Note how num in macroImpl doesn’t have any implicit definition.

Then the test code:

implicit val num: Double = 1.1
println(macroTest("abc"))

It will print this during the compile time:

Name of implicit num: me.binwang.archmage.coretest.MethodMetaTest.num

And this during the run time:

1.1

4.4 Read code block with by-name parameter

Macros can also take by-name parameter. However, it needs to use c.Tree instead of c.Expr as parameter in the macro implementation:

def macroImpl(c: blackbox.Context)(s: c.Tree) : c.Expr[String] = {
  println(s)
  c.Expr[String](s)
}

def macroTest(s: => String): String = macro macroImpl

See how c.Tree is converted to c.Expr. You can also convert c.Expr to c.Tree by using the .tree method, which we’ve seen in the examples above.

Test it with this code:

macroTest {
  val a = "a"
  val b = "b"
  println("hello!")
  a + b
}

It will print out this during compile time:

{
  val a: String = "a";
  val b: String = "b";
  scala.Predef.println("hello!");
  a.+(b)
}

4.5 Use Quasiquotes

Quasiquotes, or q"...", is a very powerful syntax for Scala macro. It can both match a tree and generate a tree. For example, the following code can match different parts of a if else clause to c.Tree variables:

def macroImpl(c: blackbox.Context)(s: c.Tree): c.Tree = {
  import c.universe._
  val q"if ($cond) $thenp else $elsep" = s
  println(cond)
  println(thenp)
  println(elsep)
  q"$cond"
}

def macroTest(s: => Any): Any = macro macroImpl

cond, thenp and elsep are all matched parts from the input tree.

q"$cond" generates a new tree using the matched condition part of the tree. We will see more details in how to use quasiquotes to generate trees in section 5.4.

Test it with this code:

val bigNum = 2
val smallNum = 1
val result = macroTest {
  if (bigNum > smallNum) {
    "no surprise"
  } else {
    "surprise!"
  }
}
println(result)

During the compile time, it will print out the different parts of the tree that we have asked it to match:

bigNum.>(smallNum)
"no surprise"
"surprise!"

And during the run time, it will print out the value of condition instead of either if or else clause:

true

More examples about how to match the tree can be found in the document. Click on each example to see more details.

5. How to Generate Tree

5.1 Construct Tree Directly with API

An AST can be constructed from the classes that represent the tree. For example, a constant of string can be created by Literal(Constant("I replaced you!")). The following example replace any string to I replaced you:

def macroImpl(c: blackbox.Context)(s: c.Expr[String]) : c.Expr[String] = {
  import c.universe._
  c.Expr[String](Literal(Constant("I replaced you!")))
}

def macroTest(s: String): String = macro macroImpl

With the code below, it will print I replaced you! instead of abc:

println(macroTest("abc"))

This is a very simple example. When the tree becomes larger and larger , it’s more and more difficult to construct a tree with this approach. It’s like a much worse version of lisp. So in the following sections, we will see some easier ways to construct a tree.

5.2 Use `c.parse`:

c.parse can parse a string as Scala code and generate an AST. For example, the following macro returns the variable name of a String:

def macroImpl(c: blackbox.Context)(s: c.Expr[String]): c.Expr[String] = {
  val name = s.tree.symbol.fullName
  c.Expr(c.parse(s""" "Name of var is: $name" """))
}

def macroTest(s: String): String = macro macroImpl

Then use it like this:

val a = "abc"
println(macroTest(a))

It will print out:

Name of var is: me.binwang.archmage.coretest.MethodMetaTest.a

Note the output is at run time instead of compile time like the examples in the last section, because we’ve replaced the tree with new code.

5.3 Use `reify`

c.parse is easy to use and understand. But when generating more and more code with it, it can be pretty messy since it is just a string. There is no syntax checks in IDE. Even worse, you cannot get any run time information to use in the generated tree.

reify is a much better option. You can write code as usual. The code in reify block is the code that will be generated. You can refer to another Expr (in the old tree) by using its .splice method. Here is an example to print out both the variable name and it’s value:

def macroImpl(c: blackbox.Context)(s: c.Expr[String]): c.Expr[String] = {
  import c.universe._
  val name = c.Expr(c.parse("\"" + s.tree.symbol.fullName + "\""))
  reify {
    s"${name.splice}: ${s.splice}"
  }
}

macroTest and the test code is the same above. Running the test code will get output like this:

me.binwang.archmage.coretest.MethodMetaTest.a: abc

5.4 Use Quasiquotes

As we’ve seen in section 4.5, q"..." can be used to match a tree. It can be used to generate a tree as well. For example, in the following code:

def macroImpl(c: blackbox.Context)(s: c.Tree) = {
  import c.universe._
  val q"if ($cond) $thenp else $elsep" = s
  q"if ($cond) $elsep else $thenp"
}

def macroTest[T](s: T): T = macro macroImpl

It uses the parts that have been matched by q"..." and generates a new tree using those parts. It swaps the if and else clause. Run it with this test code:

macroTest(if (true) println("a") else println("b"))

It will print b instead of a.

5.5 Avoid Name Conflict

When generating a new tree, we may generate some variables that have conflict names with the existing ones. Use c.freshName to get a unique name to avoid the conflict.

5.6 Type Checked and Unchecked Tree

There are two kinds of AST in Scala’s internal compiler: type checked and unchecked. See more details in this Stack Overflow answer. Some APIs can only accept either type checked or unchecked tree. And sometimes the compiler throws out weird errors if using the wrong type of tree. If that’s the case, try to use c.untypecheck and c.typecheck to covert trees.

For example, here is some code that cannot be compiled:

def macroImpl(c: blackbox.Context)(blockTree: c.Tree) : c.Expr[Seq[String]] = {
  import c.universe._
  val block = c.Expr[Seq[String]](blockTree)
  reify {
    Seq("a").flatMap{_ => block.splice}
  }
}

def macroTest(blockTree: => Seq[String]): Seq[String] = macro macroImpl

// Testing code in another sub project:
val s = "abc"
macroTest {
  val a = s
  Seq(a)
}

The compiler will throw error:

[error] Error while emitting XXX.scala
[error] value a
[error] one error found

To fix this, we need to convert blockTree to unchecked tree:

def macroImpl(c: blackbox.Context)(blockTree: c.Tree) : c.Expr[Seq[String]] = {
  import c.universe._
  val cleanedBlock = c.untypecheck(blockTree.duplicate)
  val block = c.Expr[Seq[String]](cleanedBlock)
  reify {
    Seq("a").flatMap{_ => block.splice}
  }
}

ZFS Profiling on Arch Linux

2023-12-14T00:00:00-05:00

I bought a new video game recently but found z_rd_int processes took almost all the CPU time when I was playing it. That doesn’t make much sense to me since I install games on a non compressed ZFS dataset. Even though I don’t have a powerful CPU, I don’t expect ZFS to use all of them and only reads about 60-70MiB/s from each of the NVME SSDs. To double check, I used iostat -x 1 to confirm the iowait is very low. So disk IO is not the bottleneck.

Without finding any root cause from Internet, I decide to do some profiling by myself. From OpenZFS’ Github issues, people are using perf to do profiling. It is trivial enough to do it from a glance. But let perf showing debug symbols for ZFS spent me a lot of time. So in this article, I will document the steps to enable debug symbols for ZFS and hopefully it can help more people that facing difficulties to do it. After that, I will continue with how do I find the root cause and the solution. If you’ve seen my previous blog A Boring JVM Memory Profiling Story, this is an even more boring profiling story. But the tool set is important. Use them efficiently and hopefully all the profiling stories become boring.

1. Enable Debug Info for ZFS

On Arch Linux, if you run perf top, you can see kernel has debug symbols attached like this:

2.95%  [kernel]                                        [k] entry_SYSCALL_64

But for some other processes like zfs ones, it only has an address like this:

2.65%  [zfs]                                           [k] 0x00000000002990cf

This is because perf cannot find debug info for zfs module. Let’s enable it now.

1.1 Use DKMS Package

First we need to use DKMS package instead a pre compiled one so that we can control the compiling behaviour when build the zfs kernel module. In Arch Linux, the package name is zfs-dkms either in AUR or archzfs repo. Be aware packages are different from those different repos even they have the same name. Personally I like archzfs repo more since it’s more well maintained and has better dependency management.

1.2 Enable debuginfo Flags

TL;DR:

Add these three lines to /etc/sysconfig/zfs, (re)install the zfs dkms package and reboot.

ZFS_DKMS_ENABLE_DEBUG=y
ZFS_DKMS_ENABLE_DEBUGINFO=y
ZFS_DKMS_DISABLE_STRIP=y

Decompress the installed ko file.

sudo unzstd /lib/modules//updates/dkms/zfs.ko.zst

Now you should be able to see zfs symbols in perf top.

Remember to cleanup the files after profiling.

If you care about the reason behind these changes, continue reading. Otherwise you can skip the remaining of this section.

What is `/etc/sysconfig/zfs`?

The package zfs-dkms only installs the code that will be compiled by dkms to /usr/src/zfs-. (I learned this by reading PKGBUILD of the aur package). Then when dkms commands are run, dkms copies the files to /var/lib/dkms/zfs//build to build it and then install the built ko files to /lib/modules//updates/dkms. So in order to build zfs module with debug symbols, we need to let dkms uses correct compile flags.

Under /usr/src/zfs-, there is dkms.conf that tells DKMS how to use the source code to build and install modules. We can find some key information there:

PRE_BUILD="configure
  --prefix=/usr
  --with-config=kernel
  --with-linux=\$(
    if [ -e "\${kernel_source_dir/%build/source}" ]
    then
      echo "\${kernel_source_dir/%build/source}"
    else
      echo "\${kernel_source_dir}"
    fi
  )
  --with-linux-obj="\${kernel_source_dir}"
  \$(
    [[ -n \"\${ICP_ROOT}\" ]] && \\
    {
      echo --with-qat=\"\${ICP_ROOT}\"
    }
  )
  \$(
    [[ -r \${PACKAGE_CONFIG} ]] \\
    && source \${PACKAGE_CONFIG} \\
    && shopt -q -s extglob \\
    && \\
    {
      if [[ \${ZFS_DKMS_ENABLE_DEBUG,,} == @(y|yes) ]]
      then
        echo --enable-debug
      fi
      if [[ \${ZFS_DKMS_ENABLE_DEBUGINFO,,} == @(y|yes) ]]
      then
        echo --enable-debuginfo
      fi
    }
  )
"

There is --enable-debug and --enable-debuginfo. Run ./configure --help shows the meaning of these two flags:

  --enable-debug          Enable compiler and code assertions [default=no]
  --enable-debuginfo      Force generation of debuginfo [default=no]

So if those two flags are enabled, the zfs module should be built with debug info. The code above checks ZFS_DKMS_ENABLE_DEBUG and ZFS_DKMS_ENABLE_DEBUGINFO in file ${PACKAGE_CONFIG}. If they are y or yes, the corresponding flags are enabled. At the beginning of dkms.conf we can find PACKAGE_CONFIG is defined as /etc/sysconfig/zfs.

However, only defining ZFS_DKMS_ENABLE_DEBUG and ZFS_DKMS_ENABLE_DEBUGINFO is not enough. I learnt it the hard way. Checking dkms.conf more closely, we can see these code below:

STRIP[0]="\$(
  [[ -r \${PACKAGE_CONFIG} ]] \\
  && source \${PACKAGE_CONFIG} \\
  && shopt -q -s extglob \\
  && [[ \${ZFS_DKMS_DISABLE_STRIP,,} == @(y|yes) ]] \\
  && echo -n no
)"

man dkms shows the meaning of STRIP:

STRIP[#]=
       By default strip is considered to be "yes". If set to  "no",  DKMS  will
       not  run strip -g against your built module to remove debug symbols from
       it.  STRIP[0] is used as the default for any unset entries in the  STRIP
       array.

If STRIP is not set to no, dkms will stripe the debug info! So we also need to set ZFS_DKMS_DISABLE_STRIP in /etc/sysconfig/zfs to y or yes so that STRIP[0] will be no.

Why unzstd?

In my system, the dkms modules are compressed with zstd when installing. But it seems perf is not able to read the compressed module file in order to find the debug symbols, so we need to uncompress it at the same location.

2. Profiling ZFS

perf top can show the CPU usage for each function in real time. But in order to analysis it better, we can record it with perf record -g -p . It should generate perf.data file in the current directory. Press Ctrl + C to stop the recording and flush the file.

Then use sudo perf report to show the report of the recording. Mine is like this (press + to extend a row of interest in perf report):

Samples: 277K of event 'cycles:P', Event count (approx.): 244633155596
Children      Self  Command   Shared Object     Symbol
+   96.59%     0.01%  z_rd_int  [zfs]             [k] zio_do_crypt_uio
+   96.58%     0.00%  z_rd_int  [zfs]             [k] crypto_decrypt
+   96.57%     0.01%  z_rd_int  [zfs]             [k] aes_decrypt_atomic
+   75.53%     8.17%  z_rd_int  [zfs]             [k] aes_encrypt_block
+   49.76%     0.00%  z_rd_int  [zfs]             [k] crypto_update_uio
+   49.76%     0.00%  z_rd_int  [zfs]             [k] aes_decrypt_contiguous_blocks
+   49.76%     4.52%  z_rd_int  [zfs]             [k] ccm_mode_decrypt_contiguous_blocks
+   46.42%     2.08%  z_rd_int  [zfs]             [k] ccm_decrypt_final
+   42.15%     6.94%  z_rd_int  [zfs]             [k] aes_aesni_encrypt
-   24.72%    24.36%  z_rd_int  [zfs]             [k] kfpu_end
     24.36% ret_from_fork_asm
        ret_from_fork
        kthread
        0xffffffffc02b15eb
        zio_execute
        zio_done
        zio_pop_transforms
        zio_decrypt
        spa_do_crypt_abd
        zio_do_crypt_data
        zio_do_crypt_uio
        crypto_decrypt
      + aes_decrypt_atomic
-   21.20%    20.96%  z_rd_int  [zfs]             [k] kfpu_begin
     20.96% ret_from_fork_asm
        ret_from_fork
        kthread
        0xffffffffc02b15eb
        zio_execute
        zio_done
        zio_pop_transforms
        zio_decrypt
        spa_do_crypt_abd
        zio_do_crypt_data
        zio_do_crypt_uio
        crypto_decrypt
      + aes_decrypt_atomic
+   14.42%    14.21%  z_rd_int  [zfs]             [k] aes_encrypt_intel
+    7.36%     7.14%  z_rd_int  [zfs]             [k] aes_xor_block
+    6.31%     6.16%  z_rd_int  [zfs]             [k] aes_copy_block
+    1.27%     0.03%  z_rd_int  [zfs]             [k] arc_read_done
+    1.17%     0.02%  z_rd_int  [zfs]             [k] zio_vdev_io_done
+    1.14%     0.00%  z_rd_int  [zfs]             [k] abd_iterate_func

3. Find Root Cause

From the profiling report, we can easily see that the CPU is mostly used on decrypting the content on ZFS. That makes some sense because decryption do need CPU power. But there is no reason it uses so much CPU at that throughput. In fact found some performance issues related encryption and did something to rule out some causes:

I made sure the AES hardware acceleration is enabled for my CPU by checking lscpu | grep aes.
My system can decrypt and encrypt at a much higher speed (2000+ MB/s) by running cryptsetup benchmark.

That’s why I need the profiling to confirm where the bottleneck comes from.

Even though the code path is related to decryption, the hotspot is at kfpu_begin and kfpu_end. I read the code and have totally no idea what they are doing. I asked ChatGPT and it explains to me that it’s saving and restoring FPU state. I don’t know if its answer is correct or not, but that at least gave me some direction to search issues. At last I found this Github issue ICP: Improve AES-GCM performance. It says exactly that there is performance issue with saving FPU state when doing encryption. And the PR improves it for AES-GCM algorithm. It states AES-CCM can benifit from similar fix but the performance improvement will not be as great. So in the discussion of the PR, they decide to change the default encryption algorithm to AES-GCM instead of AES-CCM.

I started use zfs before this PR. So I checked the encryption algorithm on my system by zfs get all | grep encryption. And it is indeed using AES-CCM. In order to confirm it is causing performance issue, I did some benchmark on AES-CCM, AES-GCM and not encrypted datasets.

First, created the datasets:

sudo zfs create -o encryption=aes-256-ccm -o compression=off -o atime=off zroot/root/ccm-test
sudo zfs create -o encryption=aes-256-gcm -o compression=off -o atime=off zroot/root/gcm-test
sudo zfs create -o encryption=off -o compression=off -o atime=off zroot/local_steam_unencrypt

Then I write a script to benchmark it:

#!/bin/bash

set -e

function print_cputime() {
	pname=$1
	for pid in `pgrep $pname` ; do
		ps -p $pid -o cputime,etime
	done
}


function benchmark {
	test_name=$1
	test_file=$2

	file_size="20480"

	echo "### Start benchmark $test_name"

	echo "### Print z_wr_iss cpu time before the write test"
	print_cputime z_wr_iss
	echo "### Start write test"
	time dd if=/dev/random of=$test_file bs=1M count=$file_size oflag=direct
	echo "### Pring z_wr_iss cpu time afte the write test"
	print_cputime z_wr_iss

	echo "### Print z_rd_int cpu time before the read test"
	print_cputime z_rd_int
	echo "### Start read test"
	time dd if=$test_file of=/dev/null bs=1M count=$file_size
	echo "### Print z_rd_int cpu time before the read test"
	print_cputime z_rd_int
}

benchmark ccm-test /ccm-test/test-file
benchmark gcm-test /gcm-test/test-file
benchmark non-encrypt-test /data/local_steam/test-file

My ZFS cache is set to 8GB. So I write and read files with 20GB. It uses dd to write and read a file. Before the read and write, it uses ps -o cputime,etime to print out CPU time and wall time used by each related ZFS processes.

Running this script creates lots of output. The full output can be found in the appendix at the end. Here are the key lines:

### Start benchmark ccm-test
// ... output omitted ...
21474836480 bytes (21 GB, 20 GiB) copied, 107.307 s, 200 MB/s
// ... output omitted ...
### Start benchmark gcm-test
// ... output omitted ...
21474836480 bytes (21 GB, 20 GiB) copied, 13.7417 s, 1.6 GB/s
// ... output omitted ...
### Start benchmark non-encrypt-test
// ... output omitted ...
21474836480 bytes (21 GB, 20 GiB) copied, 9.03496 s, 2.4 GB/s
// ... output omitted ...

During the test, AES-CCM makes z_rd_int takes all CPU time as observed before. For AES-GCM, it’s much better, z_rd_int takes less than 50% and for non encrypted it’s less than 20%. The testing output prints the CPU time and wall time for each of the z_rd_int processes before and after the test. So you can count the percentage.

From the test result, we can see AES-CCM indeed affect read performance a lot. It’s even slower than writes. We can confirm this is the root cause for our problem.

4. Solution and Workaround

The solution is obvious: just change the encryption from AES-CCM to AES-GCM. But it cannot be done without migrating the dataset to another place and then move it back. It takes time. At the mean time, I moved my Steam library to a non encrypted dataset since I have enough disk space to do the migration. It doesn’t have sensitive information. Yes it exposes the machine to evil maid attack, but my setup on the machine doesn’t prevent it anyway. See my previous blog Personal ZFS Offsite Backup Solution for more information on putting a machine into a not trusted environment.

I’ll do the migration from AES-CCM to AES-GCM in the future and report back how it works. Stay tuned!

5. Appendix

Here is the full output from the benchmark script:

### Start benchmark ccm-test
### Print z_wr_iss cpu time before the write test
    TIME     ELAPSED
00:47:56  3-03:39:21
    TIME     ELAPSED
00:22:34  3-03:39:21
    TIME     ELAPSED
00:47:54  3-03:39:21
    TIME     ELAPSED
00:47:55  3-03:39:21
    TIME     ELAPSED
00:00:01  3-03:39:17
    TIME     ELAPSED
00:00:00  3-03:39:17
    TIME     ELAPSED
00:04:50    15:30:06
    TIME     ELAPSED
00:04:49    15:29:57
    TIME     ELAPSED
00:04:51    15:29:56
    TIME     ELAPSED
00:04:51    15:29:18
    TIME     ELAPSED
00:00:00    10:07:30
    TIME     ELAPSED
00:00:00       55:49
### Start write test
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 91.4066 s, 235 MB/s

real	1m31.414s
user	0m0.059s
sys	0m53.252s
### Pring z_wr_iss cpu time afte the write test
    TIME     ELAPSED
00:49:23  3-03:40:53
    TIME     ELAPSED
00:22:34  3-03:40:53
    TIME     ELAPSED
00:49:21  3-03:40:53
    TIME     ELAPSED
00:49:22  3-03:40:53
    TIME     ELAPSED
00:00:01  3-03:40:49
    TIME     ELAPSED
00:00:00  3-03:40:49
    TIME     ELAPSED
00:04:50    15:31:38
    TIME     ELAPSED
00:04:50    15:31:28
    TIME     ELAPSED
00:04:51    15:31:28
    TIME     ELAPSED
00:04:51    15:30:50
    TIME     ELAPSED
00:00:00    10:09:01
    TIME     ELAPSED
00:00:00       57:21
### Print z_rd_int cpu time before the read test
    TIME     ELAPSED
00:24:46  3-03:40:53
    TIME     ELAPSED
00:00:02  3-03:40:49
    TIME     ELAPSED
00:01:50       06:47
    TIME     ELAPSED
00:01:49       06:47
### Start read test
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 107.307 s, 200 MB/s

real	1m47.372s
user	0m0.060s
sys	0m8.091s
### Print z_rd_int cpu time after the read test
    TIME     ELAPSED
00:26:24  3-03:42:41
    TIME     ELAPSED
00:00:02  3-03:42:37
    TIME     ELAPSED
00:03:28       08:34
    TIME     ELAPSED
00:03:27       08:34
### Start benchmark gcm-test
### Print z_wr_iss cpu time before the write test
    TIME     ELAPSED
00:49:35  3-03:42:41
    TIME     ELAPSED
00:22:34  3-03:42:41
    TIME     ELAPSED
00:49:33  3-03:42:41
    TIME     ELAPSED
00:49:33  3-03:42:41
    TIME     ELAPSED
00:00:01  3-03:42:37
    TIME     ELAPSED
00:00:00  3-03:42:37
    TIME     ELAPSED
00:04:50    15:33:26
    TIME     ELAPSED
00:04:50    15:33:16
    TIME     ELAPSED
00:04:51    15:33:16
    TIME     ELAPSED
00:04:51    15:32:38
    TIME     ELAPSED
00:00:00    10:10:49
    TIME     ELAPSED
00:00:00       59:08
### Start write test
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 56.9529 s, 377 MB/s

real	0m56.960s
user	0m0.045s
sys	0m53.566s
### Pring z_wr_iss cpu time afte the write test
    TIME     ELAPSED
00:49:42  3-03:43:38
    TIME     ELAPSED
00:22:35  3-03:43:38
    TIME     ELAPSED
00:49:39  3-03:43:38
    TIME     ELAPSED
00:49:39  3-03:43:38
    TIME     ELAPSED
00:00:01  3-03:43:34
    TIME     ELAPSED
00:00:00  3-03:43:34
    TIME     ELAPSED
00:04:51    15:34:23
    TIME     ELAPSED
00:04:50    15:34:14
    TIME     ELAPSED
00:04:52    15:34:13
    TIME     ELAPSED
00:04:52    15:33:35
    TIME     ELAPSED
00:00:00    10:11:46
    TIME     ELAPSED
00:00:00    01:00:06
### Print z_rd_int cpu time before the read test
    TIME     ELAPSED
00:26:24  3-03:43:38
    TIME     ELAPSED
00:00:02  3-03:43:34
    TIME     ELAPSED
00:00:00       00:05
    TIME     ELAPSED
00:00:00       00:05
### Start read test
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 13.7417 s, 1.6 GB/s

real	0m13.743s
user	0m0.071s
sys	0m11.215s
### Print z_rd_int cpu time after the read test
    TIME     ELAPSED
00:26:31  3-03:43:52
    TIME     ELAPSED
00:00:02  3-03:43:48
    TIME     ELAPSED
00:00:07       00:19
    TIME     ELAPSED
00:00:07       00:19
### Start benchmark non-encrypt-test
### Print z_wr_iss cpu time before the write test
    TIME     ELAPSED
00:49:42  3-03:43:52
    TIME     ELAPSED
00:22:35  3-03:43:52
    TIME     ELAPSED
00:49:40  3-03:43:52
    TIME     ELAPSED
00:49:39  3-03:43:52
    TIME     ELAPSED
00:00:01  3-03:43:48
    TIME     ELAPSED
00:00:00  3-03:43:48
    TIME     ELAPSED
00:04:51    15:34:37
    TIME     ELAPSED
00:04:50    15:34:28
    TIME     ELAPSED
00:04:52    15:34:28
    TIME     ELAPSED
00:04:52    15:33:49
    TIME     ELAPSED
00:00:00    10:12:01
    TIME     ELAPSED
00:00:00    01:00:20
### Start write test
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 56.0508 s, 383 MB/s

real	0m56.052s
user	0m0.042s
sys	0m53.060s
### Pring z_wr_iss cpu time afte the write test
    TIME     ELAPSED
00:49:46  3-03:44:49
    TIME     ELAPSED
00:22:35  3-03:44:49
    TIME     ELAPSED
00:49:44  3-03:44:49
    TIME     ELAPSED
00:49:43  3-03:44:49
    TIME     ELAPSED
00:00:01  3-03:44:44
    TIME     ELAPSED
00:00:00  3-03:44:44
    TIME     ELAPSED
00:04:51    15:35:33
    TIME     ELAPSED
00:04:50    15:35:24
    TIME     ELAPSED
00:04:52    15:35:24
    TIME     ELAPSED
00:04:52    15:34:46
    TIME     ELAPSED
00:00:00    10:12:57
    TIME     ELAPSED
00:00:00    01:01:16
### Print z_rd_int cpu time before the read test
    TIME     ELAPSED
00:26:31  3-03:44:49
    TIME     ELAPSED
00:00:02  3-03:44:45
    TIME     ELAPSED
00:00:07       01:16
    TIME     ELAPSED
00:00:07       01:16
### Start read test
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 9.03496 s, 2.4 GB/s

real	0m9.036s
user	0m0.032s
sys	0m8.207s
### Print z_rd_int cpu time after the read test
    TIME     ELAPSED
00:26:33  3-03:44:58
    TIME     ELAPSED
00:00:02  3-03:44:54
    TIME     ELAPSED
00:00:09       01:25
    TIME     ELAPSED
00:00:09       01:25

Introduce K3s, CephFS and MetalLB to My High Avaliable Cluster

2023-11-28T00:00:00-05:00

In a previous blog Infrastructure Setup for High Availability, I talked about how I setup a cluster infrastructure for high availability applications. I have made a few changes since then. This blog is to talk about them in details.

Updated Architecture Overview

Comparing the diagram with the one in Infrastructure Setup for High Availability, the overall structure remains the same, with a few modifications:

Not shown in the graph, but replaced official Kubernetes with K3s.
Replaced GlusterFS with CephFS.
Included cert-manager to get SSL certificates.
Replaced Keepalived on each node with MetalLB.

Replace Kubernetes with K3s

I didn’t know K3s back when I setup my Kubernetes cluster for the first time. But since then I heard a lot of good things about it at various places. However, the complexity of migration and its installation method through a script from Internet instead of an OS package made me think twice before adopt it. But after I watched the video Talk About K3s Internals from Darren Shepherd, I realized how simple k3s is compared to Kubernetes. I highly recommend everyone who is interested in K3s watch this video.

In short, K3s is a distribution of Kubernetes instead of a fork. It does these things with a few patches: combined the components of Kubernetes into one binary and process, and removed some components not needed in a bare metal environment. By doing so, it makes its binary size and memory footprint smaller than Kubernetes, and makes it easier to deploy and manage. It only needs a binary k3s and a configuration file under /etc/rancher/k3s/config.yaml to start, and all of its content is under /var/lib/rancher/k3s. The official install script adds a little bit more than just the binary file: it has a few scripts to kill and uninstall k3s. It also includes systemd file to start/stop k3s through systemd. So even though it’s not packaged into a standard OS package, I think the complexity is manageable so I started to experiment with it.

It’s very easy to config K3s since all it needs is a configuration file on each machine. I created a virtual machine cluster with Vagrant in the project k3s-vm-cluster to experiment with it. Feel free to play with it to get a feel with it before go all in. The setup is based on the official guide for High Availability Embedded etcd. It’s the easiest way to setup a high available K3s cluster.

No load balancer setup is needed if no external Kubernetes API server HA is needed. That means, you can access to Kubernetes API server within the cluster if any of the machine fails. But if you still want to access it outside of the cluster during a failure, check this doc. Alternatively, I think load balancer like MentalLB can also do it, but I don’t need it so I didn’t experiment with it.

Distributed Storage System: GlusterFS to CephFS

The biggest motivation drives this migration is the deprecation of GlusterFS. I’m using distributed file system for a few use cases:

Configuration files: this can be migrated to Kubernetes ConfigMaps.
Logs: this can be migrated to a centralized log management system like ElasticSearch. But some of them like Loki in turn depends on another distributed storage.
Data files: this is most complex one. Some of the services support saving files into S3 compatible systems. But some of them don’t. (I cannot control the services since I only self host them instead of developed them). One option is to not having HA and just bind those services into a specific host and use local storage.
Docker registry: this belongs to the point “Data files” above, but this is very import so I separate into another point. I’m using Sonatype Nexus as the docker registry. It supports to put packages into S3 but still pretty tricky to get rid of all the local files. This is a service that absolutely needs HA if I want to have a HA cluster. Or I can change to another Docker registry implementation, but I feel pretty comfortable using it so I don’t want to change it.

So it basically comes down to these 2 options:

Use a S3 compatible storage like MinIO but do a lot of work to configure services to store files into that, and make services cannot do that not HA anymore.
Go ahead and uses a real distributed file system like CephFs or Longhorn.

Update: I also explored LINBIT which I forgot to write it here. It got more and more complex when I went into the rabbit hole. But its architecture looks very interesting to me. So I may explore it more in the future for other use cases.

Option 1 sounds appealing to me at first since I really don’t want to deal with the complexity of setting up CephFS. But as I go into the rabbit hole, I found configuring the services to use S3 may be a more complex process and less portable than just setup CephFS. So at the end I decide to go option 2.

I’ve heard of CephFS long time ago but decided to use GlusterFS at previous setups because of the level of user friendly. So CephFS seems like a nature choice after GlusterFs is deprecated. Especially when I found other than the distributed block device, it also supports file system and S3 compatible storage system. It’s also easier to install than before because of Rook. Longhorn is another choice I looked a little bit but because of wider adoption of CephFS and more features of it, I decide to use CephFS at the end.

The way I use it is mainly Ceph Filesystem, so it’s easier to share volumes between pods. Again, the project k3s-vm-cluster has an example about it. Try to play it if you are interested in it. Along the way I actually contributed to Rook project by improving doc (#13045) and its error message (#13046).

Network Gateway

In the previous article, I talked about using Cloudflare tunnel, or NodePort and Keepalived to expose services to the Internet. But there are some other things a network gateway can do other than just expose the service: it can also do things like terminate SSL encryption and so on. Cloudflare tunnel support terminate SSL at their end so I don’t need to worry about that. But for some services, I don’t want Cloudflare to see the traffic, so I need to terminate SSL and expose service by myself.

As I said, expose service part was done by NodePort and Keepalived, which is not very elegant but works. For the terminate SSL part, I was using Nginx as reverse proxy. But updating SSL certificates is a little bit more complex. I don’t want to talk it in details here because the setup is pretty complex and explaining it will be very lengthy. The point is, with this migration, I want to revisit this part to make it simpler and more elegant.

Kubernetes has a concept of Ingress, and newer but less mature, Gateway. What they are doing is essentially reverse proxy like Nginx. In fact, Nginx Ingress is a thing. The advantage is that you don’t need to configure all the services in a single place like Nginx’s configuration files. You can create Kubernetes resources for each of the service. So that the deployment and configuration of each service is totally self contained. This is a very good feature, especially for a company: when I first started to use Kubernetes at 2015 in a previous company, I felt the pain of not having it. But the feature of Ingress is pretty limited. For example, it can only bind to 443. It cannot modify the http content, and so on. So that I may still need a layer of Nginx for my use cases. The design of gateway is too complex and the features don’t really meet all my requirements as well.

There are some players like Traefik(shipped with K3s by default) and Istio which overcome the limitations by having their own custom resources. But Traefik cannot get new certificates from Let’s Encrypt with a HA setup. Istio is just too complex and include features like service mesh that I don’t need. I can see how service mesh can be useful in big companies, but I prefer not to have another layer on my own service. At the end, I don’t think the complexity worth it.

But while I exploring Traefik and Istio, I found cert-manager, which can be deployed into Kubernetes. It can get certificates from Let’s Encrypt and put them into Kubernetes secrets, which then can be mount into each pods. It supports Cloudflare DNS API for ACME DNS challenge, so I don’t need to export a http service for Let’s Encrypt to verify the ownership of the domain name. With all of this features, I decided to use it and mount the certificates into Nginx pods. It resolves the problem of update certificates from Let’s Encrypt.

For the other problem of exposing the services to Internet in a HA way, I want to use a more Kubernetes native way instead of setup Keepalived outside of the Kubernetes cluster. Kubernetes supports external load balancers. But most of the load balancers it supports are from cloud. Then I found MetalLB, which supports creating a HA load balancer without special hardware in a bare metal cluster. I use it with layer 2 mode, which creates a virtual IP like keepalived and can failover to another node.

Deploy Services with Code

What I didn’t talk in the previous blog is, I define the deployment of my services as code instead yaml files. It gives lots of advantages: first, you can create models for your own deployment pattern so that you can avoid lots of redundant code. Traditionally it’s hard to define the deployment as code. There are lots of frameworks to do it but none of them is easy to use. But with Kubernetes, all you need is generating a resource object for Kubernetes to use at the end. You can construct it in any way with your favorite language, and either output a YAML or call Kubernetes API directly. It’s using a high level language instead of writing machine code directly. It’s much more elegant and the maintenance is much easier. Be aware: use a real language instead of some template language. Why limit your power to do things?

This approach works so well especially during this migration. For example, I abstracted all the storage layer for my services, so that when I migrated from GlusterFS to CephFS, I just need to change the storage class to define the CephFS volume, and the code for services don’t need to change much.

Hope you enjoy my experience of setting up a HA cluster. Happy hacking and have fun with your own cluster!

Update on RSS Brain to Find Related Articles with Machine Learning

2023-11-14T00:00:00-05:00

In the previous blog about RSS, How RSS Brain Shows Related Articles, I talked about how RSS Brain finds the related articles. I’ve updated the algorithm recently. This blog is about the details about the update. The basic idea is to replace tf-idf algorithm with text embeddings to represent the articles as vectors, and use ElastcSearch to store and query those vectors.

The Disadvantages of Previous Algorithm

First let’s do a quick revisit on the algorithm before the update: it’s using tf-idf algorithm. Which is basically an algorithm to represent each document as a vector by using the words’ frequency in it. It’s an algorithm that is easy to understand, and works well enough in practice to power lots of searching engines for a long time. However, it has a few shortcomings:

First, it doesn’t understand the meaning of the word. A word can mean different things based on context, order, combinations and so on. Different words can also have the same meaning. Word frequency along doesn’t catch that.

Second, “word” needs to be defined. Which is a relatively easy task for languages like English, since it has a built-in word separator character (space). However, for languages like Chinese, there is no obvious way to separate the words. The performance of tf-idf algorithm largely depends on the performance of word separating algorithm, which itself is much more complex than tf-idf and often involves machine learning as well. Even for languages like English, in order to minimize the first disadvantage above, the words are usually broke down so that some similar words can be matched.

Last, which is an extension of the first disadvantage: it’s hard to do multi language matches. Word frequency along doesn’t know that different words in different languages can mean the same thing. Of course you can translate the document to other languages and index the translated documents, but it doesn’t scale well when you need to support more and more languages. And translation algorithms are usually much more complex than tf-idf, and mostly use machine learning too.

Word and Document Embeddings

With the advancement of machine learning, a new method to represent words as vectors has been developed in the paper Efficient Estimation of Word Representations in Vector Space. The vector is called word embedding. Then based on the idea, Distributed Representations of Sentences and Documents explores representing paragraphs as a vectors. Without go into the details, the basic idea is to get a layer from neural network for a NLP task.

For example, if we have a neural network to predict the nth word given previous words, then we may have a neural network like this:

word[1]   --> vector[1]
word[2]   --> vector[2]    --> layer2 --> ... -> classifier -> output
...
word[n-1] --> vector[n-1]

Words are mapped to vectors at the first layer, with something like

\[v = w * W + b\]

Which $v$ is the vector, $w$ is the one-hot encoded word. And matrix $W$ and $b$ is the trained parameters. There are many other parameters in the later layers of the neural network but we don’t care. We only take $W$ and $b$ so that we can compute the vector for any word. With this method, the represented vectors can measure similarities between words by computing similarity of the vectors. Also surprisingly, quoted from the paper Efficient Estimation of Word Representations in Vector Space: “To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector $X = vector(biggest) − vector(big) + vector(small)$.” What a beautiful result!

I was aware of this research not long after it came out. I believe some commercial search engines started to use it since then. But the ecosystem like models, tools, databases really picked up since GPT3 came out. So recently, I decided to use it in RSS Brain because how easy to do it nowadays.

Select a Model to Use

The first step is to select a model to use. I think OpenAI may have the best model that is available to public. You cannot access the real model but there are APIs you can call to use the model. But I don’t like it for 2 reasons: First, I don’t like OpenAI as a company: it presents itself as a non-profit organization first with the goal to make AI accessible to everyone, then stopped publish models or even the algorithm details. Second, I don’t want vendor lock-in.

There is also Llama. But it’s not really a multilingual model. I see some attempts to train it on some other languages, but the result are not that good in my experience. The license of the model is not commercial friendly as well. And there is no easy to use API to get the embeddings.

At the end I found SentenceTransformers. There are lots of pretrained models. After all I selected the model paraphrase-multilingual-mpnet-base-v2 since it’s a multilingual model. But it’s called “sentence” transformers for a reason: there is a size limit on the length of document that you can feed in to the models. I ended up to just get the embeddings for the article title. I think it’s a good enough for my use case.

Implementation Details for Model Server

The library SentenceTransformer is very easy to use. However it’s implemented in Python so it needs a way to communicate with RSS Brain server, which is written in Scala. Since this is a computation heavy task, the first though is to have a buffer queue in between so that the Python program can process the articles in a speed it can handle. Kafka is a good choice for external task queue but I don’t think it worth the complexity to import another component into the system. So I created buffer queue at both end to avoid creating too many requests while maintain some parallelism. Here is what the whole architecture looks like:

The green parts in the diagram means the workers in them can work concurrently. On the Scala side, it follows the pattern I experimented in Compare Task Processing Approaches in Scala. On the Python side, it’s more tricky since Python’s async handling is far worth than Scala’s plain old Future, not to mention effect systems like Cats Effect. I may write another blog in the future about it.

The reason I go great detail into this relatively simple problem is that it represents a category of problems: problems that need Python to do some async work because of the library supports. For example, in the future, Python server may have more features like fetching Youtube transcriptions. The architecture to integrate it into RSS Brain would be the same.

Database to Store and Query Embeddings

There are a few vector databases that can store vectors and query nearest vectors if given one. ElasticSearch added vector fields support at 7.0 and approximate nearest neighbor search (ANN) at 8.0. Since RSS Brain is already using ElasticSearch heavily for searching, I can just use it without add another database into the dependency. It also supports machine learning models so that you don’t need to insert the embedding vectors from the outside world, but I find it’s not as flexible.

Once the vectors are inserted into ElastiSearch, it’s just an API call to get the most similar documents. The details of vector insert and query are in the ElasticSearch KNN search document. One tricky part is that even though ElasticSearch supports combining ANN search with other features like term searches (tf-idf algorithm) by using a boost factor, it doesn’t work well unless you are willing to tune it. That’s because the embedding vector and term vector mean different things, and the similarity score is not really comparable. So I ended up enable vector search only for finding related articles, instead of combining with term searches.

Result

It’s actually hard to have some metrics for the performance of finding related articles. I don’t believe metrics like click rate, since it doesn’t necessarily show the articles are related. I think the only way for me is to review the results manually and compute the score based on it. But I don’t think it has much value since supporting multiple language along would make it much better than the previous algorithm. But if you are using RSS Brain, you can see the results yourself and let me know what you think about the new algorithm!

Add Index Sidebar to My Blog

2023-11-10T00:00:00-05:00

In a previous blog Add Index to My Blog, I talked about how I added an index page to my blog that put all the articles into categories. I always wanted the index to be a sidebar instead of a single page, but I guess I didn’t wrap my head around about how to implement so I gave up at last. But recently, when I started to use Obsidian and checked some demos of Obsidian Publish, I found having a sidebar is so useful and beautiful so I decide I should implement it.

You can see the result right now: if you are on a big screen device, the index is on the left side of the page. If you are on a small screen device like a mobile phone, it will show a menu button at the top left corner instead. Clicking it will take you to the index.

When I implement it, I want to keep it simple and stupid. That means:

I want to be as simple as possible as long as it has the function: show articles in nested categories.
I want to use as little Javascript as possible so people can still use it with Javascript disabled.

I found the design of Obsidian Publish is very good. So I copied lots of details from them with some modifications: I didn’t implement showing/hiding sub items when click on the index entry since I think it’s not necessary, and I like how it looks when all the articles are listed there: feels like I’ve written lots of things. The categories are sorted by alphabet order and the posts are ordered by publish date. I also added the publish year for each article entry: some articles can look outdated but if people noticed the published year they can understand the context.

Since I’m using Jeykyll, I can generate plain HTML when possible to avoid the usage of Javascript. So the sidebar is generated for each page instead of using Javascript to keep the sidebar and replace the article content on the fly. Javascript is only used for 2 features:

Remember the position of the sidebar when jump pages.
Scroll the sidebar to show the entry for the current page if it’s not in the viewpoint.

Both of the features are not that important so the sidebar is still usable without Javascript. Even for the menu button on small screens, it’s not popping up a dialog. It just jumps to a new static page that has all the index so no Javascript is needed.

The previous implementation of the index page uses recursive templates: Since the nested index is a tree, rendering the content in a recursive manner is a nature thought. However, I made that mistake to put the complex logic into the template engine. So this time, I traverse the tree with Ruby code and generates a list for the template to render. It has all the information like entry type, the depth of the entry and so on. It makes the template code much simpler so it’s easier to implement other features on top of it.

If you want to checkout the detailed implementation, go to my Github repo for the blog and check jekyll/_plugins/Index.rb and jekyll/_includes/index_menu.html.

How to Cleanup Ceph Filesystem for Deleted Kubernetes Persistent Volume

2023-11-04T00:00:00-04:00

Ceph is a distributed file system. Rook is a project to deploy it with Kubernetes. I recently replaced GlusterFS in my Kubernetes cluster with Ceph. I will write a blog (or a series of blogs) for the migration. But in this article, I will just talk about a problem I encountered, just in case I forget it.

Once Rook is deployed in Kubernetes, you can create a Ceph Filesystem and use it to persistent volume (PV). Each PV’s data will be stored in a folder in the filesystem. If the PV’s reclaiming policy is set to retain, the data will not be deleted after the persistent volume is manually deleted. It’s safer in this way. But what could you do if you want to cleanup the data? Normally you should change the PV’s reclaim policy before you delete the PV, then Rook’s operator will auto reclaim the storage in Ceph. But what if you forget or didn’t know that (like me), and want to cleanup the data after?

First, we need to the folder/subvolume names in Ceph that store’s each PV’s data. We an get that by using kubectl describe pv and look for the field subvolumeName. But since the PV is deleted, we need to find the mappings for existing PVs and compare that with the folders/subvolumes in Ceph. This is the command to show all of the existing ones:

kubectl get pv -o yaml | grep subvolumeName  | sort

Then we need to find all the existing folders/subvolumes in Ceph’s filesystem: Start a Ceph toolbox pod based on the doc. Then go into the pod and find the filesystem’s name first:

ceph fs ls

After getting the filesystem’s name, get all the subvolumegroup from it:

ceph fs subvolume ls  csi | grep 'name' | sort

Compare this list with the list above, you should be able to find a subvolume that exists in Ceph but not shown in Kubernetes’ PV mapping. Use this command to check its info:

ceph fs subvolume info   csi

If you are sure this is the folder you want to delete, use this command to delete it:

ceph fs subvolume rm   csi