<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://www.binwang.me/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.binwang.me/" rel="alternate" type="text/html" /><updated>2026-05-04T16:45:35-04:00</updated><id>https://www.binwang.me/feed.xml</id><title type="html">Bin Wang - My Personal Blog</title><subtitle>This is my personal blog about computer science, technology and my life.</subtitle><entry><title type="html">TCode: An AI Coding Agent Leverages Neovim and Tmux</title><link href="https://www.binwang.me/2026-04-12-TCode-An-AI-Coding-Agent-Leverages-Neovim-and-Tmux.html" rel="alternate" type="text/html" title="TCode: An AI Coding Agent Leverages Neovim and Tmux" /><published>2026-04-12T00:00:00-04:00</published><updated>2026-04-12T00:00:00-04:00</updated><id>https://www.binwang.me/TCode:-An-AI-Coding-Agent-Leverages-Neovim-and-Tmux</id><content type="html" xml:base="https://www.binwang.me/2026-04-12-TCode-An-AI-Coding-Agent-Leverages-Neovim-and-Tmux.html"><![CDATA[<blockquote>
  <p>If I have seen further, it is by standing on the shoulders of giants.</p>

  <p>– Isaac Newton</p>
</blockquote>

<p>In the last <a href="/2026-01-12-My-2025-in-Review.html">year end review post</a>, I mentioned I want to try coding agents other than Claude Code, since I want to avoid vendor lock-in for such an important tool in the age of AI coding. I said I would try some open source implementations like Open Code and migrate to them. The plan has been executed much better than I thought: just in the beginning of the new year, I implemented my own coding agent called <a href="https://github.com/wb14123/tcode">TCode</a>, and it’s much more powerful than the tools like Claude Code and Open Code.</p>

<h2 id="problems-of-existing-tools">Problems of Existing Tools</h2>

<p>Other than the vendor lock-in problem, the existing coding agents also have other problems.</p>

<p>Since those tools are TUI based, I expect to use them in a familiar way with vim like keybindings. In this aspect, Claude Code is actually much better than Open Code, since it doesn’t clean up the <strong>main</strong> conversation history in the output (main conversation means messages not including tool calls or subagent details), so that I can navigate the history with tmux. (But I find Claude Code seems to start cleaning up the conversation history in the output too sometimes, not sure if it’s a bug or a feature). For Open Code, since it manages its own output and buffer, I must use its keybindings for basic operations like navigation through conversation, which doesn’t support vim like keybindings and is very hard to use (at least for me).</p>

<p>Other than the keybindings, there is a trade off about how much detail to show about things like thinking tokens, tool calls, subagents, permissions and so on: showing too much, it’s noisy; showing too little, the user doesn’t know what is actually happening. Claude Code has some shortcuts like ctrl + o to expand some of the details, but I never find it easy to use. And even with that, the details are not enough and Claude Code is continuously showing less and less information saying it’s a distraction for the user, which many users disagree with.</p>

<p>The last problem is transparency. Closed source projects make it hard to understand what has been changed. Even a simple prompt change can make a big difference for the result. Recently there are many reports of Claude Code performance degradation or cost increasing, and many of them are not because of the LLM model, but about the client changing its behavior. Also the quality of Claude Code doesn’t seem to be very good: there are always small issues here and there, which are not deal breakers but annoying, and they are hard to fix because it’s closed source.</p>

<p>With the problems existing in even open source tools, I decided to write my own tool, so that not only it fits my workflow, but I can also have a deeper understanding about how the LLM works. Luckily, I don’t think it’s too hard to implement one. In contrast, even though I think IDEs are important too, I wouldn’t write one like IntelliJ IDEA, because it’s just not worth the effort.</p>

<h2 id="how-tcode-resolves-the-problems">How TCode Resolves the Problems</h2>

<p>The last problem listed above can mostly be resolved by open sourcing, which I ended up doing for TCode. For the first two problems, TCode resolves them by leveraging existing tools: <a href="https://neovim.io/">Neovim</a> and <a href="https://github.com/tmux/tmux">tmux</a>.</p>

<p>Let me give a brief introduction of these two dependencies first even though they are already very popular, just for the people who are not very familiar with them. Neovim is a drop in vim replacement with better defaults and easier plugin creation (critical to be used as a dependency). I never used it much before. But because of writing and using TCode, I started to use it and find it very pleasant, especially with projects like <a href="https://www.lazyvim.org/">LazyVim</a> to give a good start on configuration.</p>

<p>Tmux is a “terminal multiplexer”, which I still don’t (bother to) know the exact meaning even after more than 10 years of using it. I’d like to think of it as a tiling window manager, but just for terminals: you can divide the terminal window vertically and horizontally, to whatever many levels you want. You can also create new tabs and sessions. All being able to navigation using convenient keybindings. It’s the core tool of my terminal workflow. I cannot live without it: for example, it’s one of the reasons I can tolerate MacOS.</p>

<p>Before I start with TCode itself, let me show a screenshot of it, so it’s easier to understand what I’m talking.</p>

<p><img src="https://raw.githubusercontent.com/wb14123/tcode/refs/heads/master/assets/demo.gif" alt="demo" /></p>

<h3 id="writing-and-showing-messages-with-neovim">Writing and Showing Messages with Neovim</h3>

<p>How does TCode leverage those 2 tools to resolve the problems above? Let’s first talk about keybindings, focusing on the keybindings of message writing and showing first. The AI coding agent’s interface is very similar to a chat application, but with much more complexity. For showing messages, it usually needs to navigate through the conversation, because the conversation messages can often be too long to fit in a single screen, and often have markdown styles and code in them. For editing messages, a good editor makes it easier to write better prompts: things like code snippet, more structured thoughts and so on.</p>

<p>Both problems can be resolved very well by vim, or Neovim in this case: vim supports so many highlight styles, convenient keybindings under normal mode to navigate, and itself is one of the most powerful editors. And better: it can all use the user’s existing customization. So when adopting TCode, the user doesn’t need to get used to the new style, new color scheme and keybindings, they can just use the existing ones.</p>

<p>So, TCode uses Neovim for showing and writing messages, which are the two panels on the left of the screenshot above.</p>

<p>In addition to it, Neovim is also used to preview diff, file writes and bash commands.</p>

<h3 id="use-tmux-to-manage-details">Use Tmux to Manage Details</h3>

<p>From the screenshot, you can see there are a few panels including the writing and showing messages ones we just talked about. All of them are in tmux, so you can just use keybindings to navigate through them. You can also customize the layout.</p>

<p>But even more powerful is how it resolves the problem of balancing the level of details: by default, the main conversation panel just shows the overview of the messages. For tool calls and subagents, it just shows a line to describe it and collapsed input. But if you want to see what is really happening in it, you can open the detailed view in a new tmux tab. For example, for the subagent, it has all the things like the main conversation: you can even interrupt it and send new messages to it. Those details are all available without making the main conversation’s UI noisy, all thanks to the powerful tmux.</p>

<h2 id="other-improvements">Other Improvements</h2>

<p>Other than the main features I talked about above involving Neovim and tmux, I also made lots of tweaks in TCode, because I can write it exactly as I want. Like improvement of the bash tool, shortcuts like <code class="language-plaintext highlighter-rouge">/plan</code>, <code class="language-plaintext highlighter-rouge">/review</code> that expand to detailed prompts in Neovim editing window, using a local Chrome browser so that it’s able to use all the exiting accounts, and so on. There are just so many of them that I cannot list them all here. Read the <a href="https://github.com/wb14123/tcode/tree/master/docs">user docs</a> if you are interested.</p>

<h2 id="implementation-details">Implementation Details</h2>

<p>I also want to briefly talk about how I implemented it. Of course I implemented it using coding agents, Claude Code at the beginning, and TCode itself after the core features has been implemented. But for the core, I wrote the initial version of the llm-rs library by hand, which manages the conversation and events. This way I keep the core logic clean, and it shapes the whole architecture to what I want instead of complex prompting.</p>

<h2 id="conclusion">Conclusion</h2>

<p>So here it is, the coding agent I wrote, already one of my core tools that I use every day. I’m very happy I finished it to avoid vendor lock-in and be able to control the LLM behaviour more precisely. Hopefully other people will find it useful too. And hopefully there is a day in the near future that local LLMs can be powerful enough so that I can control the backend of it precisely as well.</p>]]></content><author><name></name></author><category term="tcode" /><category term="LLM" /><category term="vim" /><category term="tmux" /><category term="linux" /><category term="TUI" /><category term="machine learning" /><category term="AI" /><summary type="html"><![CDATA[If I have seen further, it is by standing on the shoulders of giants. – Isaac Newton]]></summary></entry><entry><title type="html">My 2025 in Review</title><link href="https://www.binwang.me/2026-01-12-My-2025-in-Review.html" rel="alternate" type="text/html" title="My 2025 in Review" /><published>2026-01-12T00:00:00-05:00</published><updated>2026-01-12T00:00:00-05:00</updated><id>https://www.binwang.me/My-2025-in-Review</id><content type="html" xml:base="https://www.binwang.me/2026-01-12-My-2025-in-Review.html"><![CDATA[<p>Here we are. The year of 2026!</p>

<h2 id="moved-to-a-new-place">Moved to A New Place</h2>

<p>Each year of my life seems to have a theme. For example, immigration, Covid, having a daughter and so on. For the year of 2025, it’s about moving. Not to another city (well, technically it is another city, but still belongs to Greater Toronto Area), but from downtown to suburb.</p>

<h3 id="the-urban-planning-problem-of-toronto">The Urban Planning Problem of Toronto</h3>

<p>The sign of moving came when I was still <a href="/2025-02-28-Travel-Back-To-China-2025.html">travelling in China last year</a>. I received an Email from the Condo manager said there was a new development proposal just in front of our building. The proposal is a 50+ storey high rise. It’s embedded into multiple surrounding buildings, with gaps only between 15 meters to 25 meters. Because it’s much taller, it will block all the surrounding buildings on its side. The area is already one of the highest density area of Toronto. I thought the proposal was ridiculous and never thought it would be approved. However, I was wrong and it went through.</p>

<p>I was very disappointed by it. I don’t think of myself as a NIMBY, and I actually like the high density urban area. I grew up living in high density Chinese cities and always lived in apartments since around the age of 10. It’s convenient and efficient. I thought it was normal until I started to <a href="/books/the-death-and-life-of-great-american-cities">get interests in urban planning</a>, watched videos about North America zoning problems, and experienced it first hand after I moved here. Toronto as the largest city of Canada, has two problems instead of one problem that is typical in most American cities. The two problems can be summarized by a single phrase: the missing middle. Let’s first talk about the typical problem across North America that Toronto has: too many low density areas. The problem is especially obvious in Toronto: even at downtown core, there are lots of single family homes just beyond the popular retail streets like Queen West. It’s really a waste of land and cannot be sustainable. Just in the year of 2025, Toronto failed to legalize sixplexes across the city, even with the most left leaning mayor in years, even with the risk of losing federal housing funds because of that.</p>

<p>On the other hand, Toronto tries to resolve the problem by creating another problem: creating areas with density that is too high. Toronto is famous for building lots of high rise Condos, even far outside of its downtown core. I really appreciate the desire of having higher density, but sometimes the density is too high for the infrastructure to support it. Especially for education: during the house hunting, I’ve seen lots of school catchment exclude the Condos just across the street. The floor plan for those buildings are also often very awkward, for example, irregular room shape, pillars in the middle of the room, mostly small one bedrooms or very small two bedrooms with no enough space for kids. It’s also very noisy. I’ve been to lots of big cities and Toronto is one of the loudest, maybe just behind New York. Instead of proper planing of the high density, Toronto just throws as many homes as possible to the land they are able to build without much push back, like in old factory and commercial area. It’s like cleaning up a house: instead of proper cleaning it, you throw all the garbage into one room and call it a day. A coworker once visited Toronto and wondered: you have so many high rise buildings, why do you still have a house shortage? Well, that is why.</p>

<p>As a result, most of these buildings don’t suit or even consider families with kids. People are mostly passers-by which don’t see the place as their long term destination. In this feedback loop, the residents of those buildings are mostly treated as second-class citizens: they are supposed to have noise, less good schools and so on. When people complain about problems like noise in forums, other people say “well, it’s downtown and what you’d expect”. The people are just never lived in a proper built high density area that just works and pleasant to live.</p>

<p>Asia countries have lots of high density areas and they are pleasant to live. For example, I find an interesting zoning pattern difference in Toronto and Chinese cities: in lots of Chinese cities, even for large cities like Beijing and Shanghai, they usually have multiple high rise buildings at the center of the block called “小区”, literally means “small area” but actually means a (sometimes gated) building compound that is built by the same developer and managed by the same company. They usually have amenities like green space, playgrounds, community centers and so on. It’s quiet because there is enough room between buildings, most of the buildings are not right next to busy road and almost all the traffic are local traffic, or even no traffic at all because of underground garage. You will trust small kids running there freely. Because of it’s at the center of the block, it doesn’t block the sun light and views on commercial streets, which makes walking and shopping on such streets pleasant. Where in Toronto, the high rise buildings are mostly on the busy streets, with single family homes at the center of a block. It makes the majority residents unpleasant to live because of the noise, and the pedestrians unpleasant to walk on the street because of the blocked sun light and wind tunnel, especially for a city with long winter.</p>

<p>The two problems makes Toronto has too few affordable quality homes. When the city brags how many new homes they have built, they are building homes no one wants to live. The new buildings proposed near my home is a good example of that: it adds lots of new homes on paper, but destroys many more homes near it. It just makes me sad because Toronto not only doesn’t develop with its full potential, but also actively destroys neighbourhoods I liked.</p>

<h3 id="house-hunting">House Hunting</h3>

<p>With the new building makes the current place really unpleasant to live, also with my daughter growing up, we decided to move to a new place. During the house hunting, I got disappointed again and again about the urban planning of the city (or cities with Greater Toronto Area). The two problems not only happens in downtown Toronto, but all over the places. There are lots of neighborhoods considered to be good but have no place to go within walking distance. There are lots of “new downtowns” in surrounding cities that built lots of high rise residential buildings, but with no commercial as planned, and even teared down some existing ones. When some of them have some commercial at the ground level, there are lots of parking space and fence separate them from the street so the street has little hope to become a vibrant shopping street.</p>

<p>The house hunting process is stressful because it’s a really big life decision: I don’t want to move often as I did before having a kid. There are so many possibilities, and the economic uncertainty makes it’s hard to make the big financial decision. It’s also time consuming: in addition to work and taking care of the kid, house hunting pretty much occupied all of my remaining free time.</p>

<p>As much as we still want to stay in the city, we were priced out for good neighborhoods with decent schools, unless it’s some really old houses which I really don’t have much confidence to maintain. So after investigated lots of places in Greater Toronto Area, we settled in the suburb. I tried my best to still make it possible to not totally depend on a car, so while the place is remote, it’s in walkable distance to schools, multiple parks, a <a href="https://en.wikipedia.org/wiki/GO_Transit">Go train station</a> and plazas that have grocery stores, pharmacy, fast food and so on. The Chinese community is strong here. It’s less than 10 minutes drive to some walkable main streets that usually host some events in the summer and holidays. There is also a plan to re-develop a street nearby to make it higher density and more main street like, even though I have very little confidence based on the past experience. At least I still have excuses to walk when I feel like it instead of sitting in front of a computer or in a car all day.</p>

<p>But even with all the plannings and careful considerations, it’s still a major life style shift. I’ve never lived in a suburb in my life. I’m not really sure if I would enjoy it or not. I still feel downtown Toronto really exciting and enjoyable. In the late spring of 2025, when I picked up a food order on a Friday night, walking along the lively Queen Street and Chinatown, looking at the vibrant lights from the shops and restaurants, where well-dressed people walking on the street, cars and street cars slowly driving by, I felt I would definitely miss the place I’ve lived for 5 years. It’s the place where I owned the first home in a new country, where my first kid was born. So I wrote a poem:</p>

<blockquote>
  <p>木兰花</p>

  <p>华灯初照长街晚，
衣袂翩翩车缓缓，
参天老树影婆娑，
春夜好风无限暖。</p>

  <p>明年许是长街远，
此梦劝君多缱绻，
关山遮月水茫茫，
青鸟去时难复返。</p>
</blockquote>

<p>It’s hard to translate it to English but I’ll give it a try:</p>

<blockquote>
  <p>Lanterns wake the long street as evening descends,</p>

  <p>Dresses flutter soft, and slow the traffic wends.</p>

  <p>Old trees cast shadows, trembling, swaying low,</p>

  <p>Spring night, sweet wind – such warmth without end.</p>

  <p>–</p>

  <p>Next year this street may lie far away,</p>

  <p>So hold this dream close, let it gently stay.</p>

  <p>Mountains veil the moon, and waters stretch to mist –</p>

  <p>The bluebird, once flown, returns not this way.</p>
</blockquote>

<h3 id="daycare">Daycare</h3>

<p>A milestone for my daughter also happened in the same spring: she started to go to the daycare. I almost overlooked this when I write this year end review blog: with the moving, the events that happened at the beginning of this year seems a few years away. Anyway, it’s such an important change to my life since my wife and I don’t need to stay with her all the time, so that we finally have some time of our own. It has huge benefit to both her and us, but with a rocky start. As expected for every child, she needed to adapt to the new schedule, which can be hard for me as well when seeing her being so sad to leave us at the beginning. But with the help of the really nice teachers there, she adapted pretty quickly and enjoyed the daycare. However, perhaps happened to every child starting a daycare, she caught all kinds of virus and the whole family was sick on and off the whole spring.</p>

<p>The daycare creates another challenge to the house hunting: the waiting list is really long for the daycares across Greater Toronto Area (or maybe across Ontario). And we really didn’t want to change my daughter’s school as she just fitted in. Luckily it turned out good after we moved to the new place and she enjoys the new daycare a lot nowadays (I touched on that a little bit on <a href="/2025-10-06-Download-Message-Images-from-Seesaw.html">a previous blog</a>). The larger new home also provides necessary space for her growing needs.</p>

<h2 id="moved-in">Moved In</h2>

<p>Other than the daycare change, the moving doesn’t seem to have as much big impact to me as I thought. I work from home anyway, and didn’t really depend on downtown for things like socialize events. The only change is I need to drive for 10-20 minutes instead of walking for 10-20 minutes for decent shopping, which is less pleasant but still acceptable.  I also lost the lake view when I bike in the summer, but I moved in not long before winter so that disadvantage has yet to be seen, and with much closer distance to nature, I feel like there will be some other outdoor activity to fill that gap. The disturbance was mainly because of the process of moving instead of life style change: when I moved, the team at work also had a small re-org, when I also needed to go on a business trip. Combined with the transition of my daughter’s daycare (she needs to wait for one month after we moved for the daycare), everything seemed to happen all at the same time.</p>

<p>But the end result is good after we moved in. I’m really glad about the layout of the new home and feel it meet the everyday function very well: I have a very bright workspace with some impressive book shelves. My wife has space for working at home if needed. My daughter has much larger space to play. We have room for a second child if we want in the future, or for family to visit. We finished the basement and now I have the room for workout (hopefully something I pickup again in the new year).  With the <a href="/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs.html">finishing of the network setup</a> so that I can continue to use my workstation, I finally felt settled down. Which is a perfect ending of the whole moving theme of the year.</p>

<h2 id="distributed-system">Distributed System</h2>

<p>Before I traveled back to China and envisioned the new year, I thought I would continue with the distributed PostgreSQL project started from <a href="/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html">the Jepsen test of Patroni</a>. However, with the house hunting and moving, I had little time. But I’m glad I still explored the possibility to setup high available PostgreSQL with drbd, which is the next direction I discussed in the article. While doing that, I found the approach had fundamentally flaws, which I captured some in the blog <a href="/2025-08-13-Why-Consensus-Shortcuts-Fail.html">Why Consensus Shortcuts Fail in Distributed Systems</a>. I think it’s my best blog post in the past year even though the structure may not be very clear to other readers: I was mostly recording the different scenarios I explored and why they would fail. Still, some people Emailed me about the article and we had some deep discussions, which rarely happens on the Internet if I didn’t create an opportunity by writing the article.</p>

<p>So with the fundamental flaws of existing solutions, I started to explore my own solution. In order to do that, I need a lower level language than Scala, the language I liked the most and used in most of my personal projects. I selected Rust since it’s gaining popular on projects like Linux kernel. With that I ended the year with two Rust projects to getting familiar with the language. The <a href="/2025-10-28-My-First-Rust-Project.html">first one</a> is mostly a toy and I didn’t really touch it much after it’s finished. However, the <a href="/2025-12-10-A-Rust-CLI-Program.html">second one</a> resolved a long time problem of mine and really showed the advantage of Rust: fast and low resource requirement. I feel like those are good starting point to write more complex distributed system in Rust.</p>

<p>For the new distributed system project in Rust, I’m not aiming for something that can be used in production, but rather to learn things and provide a theoretically solution. For example, when writing a simple file lock so that no two process can run on the same machine, I explored the OS API that provides file locks and found Rust actually provides such an API. There are lots of such little things that seems to be easy and trivial when not actually implementing it. I hope when writing the project, I can fill all such gaps.</p>

<h2 id="reading-and-writing">Reading and Writing</h2>

<p>Other than the distributed system project, this blog and readings are projects that continued. I feel like I did a decent job to write down the explorations I did in the past year in the blog, even with little free time. I’m pretty proud of that.</p>

<p>On the reading side, the number of books are less that I’d like. But I made a pretty big change on my blog to <a href="/2025-07-02-Improve-Books-Section-of-My-Blog.html">Improve Books Section of My Blog</a>. With that, I started to write notes after I finished a book. I feel like I get more out of a book instead of just “finish” reading it.</p>

<p>At the end of the year, I started a reading project about the human history, start from the most ancient ones like near east civilizations and Egypt, to classic world like Greek, Rome and Persia, to Islamic Empires, to ancient India. I already have a book list and finished <a href="/books/a-history-of-the-ancient-near-east-ca-3000-323-bc">the first one</a>. Hopefully I can finish most of them in the next year. I hope this transition of systemic readings can make me understand things better and more comprehensive.</p>

<p>In the <a href="/2025-01-13-A-2-Year-Reflection.html">last year’s review</a>, I mentioned the project RSS Brain. Unfortunately, I had little time in the past year to work on that. Even though there are some little changes, I didn’t release any version in the past year. One reason is the project is pretty mature: I use it everyday as my main gateway to the Internet’s information, and find there is little feature missing. On the other hand, the Rust projects made me have less reason to touch this Scala project. Not sure what the new year would look like for this project. Maybe I’ll add some features I wanted like 2FA login and some quality of life changes, but likely nothing big.</p>

<h2 id="a-turbulent-world-at-the-age-of-ai">A Turbulent World at the Age of AI</h2>

<p>The past year is surely an eventful year, with Trump became the new president of the US and slapped tariffs on all the countries, and even talk about annexing Canada and Greenland. In the last year’s <a href="/2025-02-28-Travel-Back-To-China-2025.html">travel back to China blog</a>, I talked about the possibility of a war between China and Taiwan, but even I thought it’s a stretch. Who would thought a war between US and Canada would on the table in this year? With Trump testing the democracy resilience of the US, it’s hard to take anything for guaranteed. For example, what if there is really a war between US and Canada?</p>

<p>Ever since I migrated to Canada, watching the local events and events in the US, I started to think democracy is flawed. It’s a paradox: the average people may not have the expertise to make the best decision for their own benefits in this complex world. But if outsource the execution to the experts, they may not represent the benefit of the average people. In China, because of the authoritarian government, democracy seems to be the potential answer to everything. But now, I’m really not sure what’s the best answer. I feel lots of people in China realized the same thing (maybe something to observe when I travel back to China this year). The only thing can prevent that maybe education to equip all, or at least most voters with enough knowledge. But that needs a desire from both the people and the politicians. I don’t see any trend like this in North America. Probably China actually has a better chance once it adopts democracy because it values education a lot, even the hope is still very dim.</p>

<p>The development of AI has only increased the uncertainty. My daughter was born at the same year as ChatGPT was released. I thought at the time, what a world my next generation is born into! In the past year, the LLM application, especially coding agents with tool uses, have advanced so much. I started from using LLM tools as a search engine, to a more advanced editor, to solve small tasks, to complex ones in large code base and even creating the whole project with supervised planning. I feel the AI coding tools can do a really meaningful part of the software engineering now. Even if there is bubble in AI, the revolution has surely started. It’s far from the consensus though: even on tech heavy websites like HackerNews, I still see lots of denial. Some people say it’s just a static machine that output words. So what? As long as it can finish the job. Some people say it can make mistakes. So what? Lots of business and engineers are also making mistakes, incidents happen all the time and they are doing fine. I know that when work on distributed systems: lots of companies are using half baked distributed systems that can lost data, make mistakes, but the business is still running as usual.</p>

<p>The AI tools still need experienced software engineer to use it well for now, and it still cannot solve really hard and complex problems. But it’s only the starting. Give it a few years, and combined with the remote work in tech jobs, I feel like the tech jobs will be like manufacturing jobs, shift from developed countries like US to low cost areas like India and east Europe. When that happens, what will happen to the current software engineers that get very high pays? In the past, I sometimes thought about the path to retiring, just to continue programming, but freely without doing things I don’t like. But recently, I started to think seriously about how to survive with salary cliffs, just to make sure I have enough to support the family even if the tech industry in Canada has been destroyed.</p>

<p>Thinking more about longer term, what will happen if AI replaces jobs and the replaced jobs don’t have higher tech jobs to migrate? In the past, when manufacture jobs are replaced by automation, the next generation can go to tech jobs that do the automation. But if AI can figure out better AI by itself one day, what’s left for human to do? Is it an end to the job? What does that mean? Does that mean everyone will have AI that produces enough things for them so that they don’t need to work anymore, and everyone can be like ancient Greek philosophers to really have free time to think deeper things like the meaning of the life? Or does that mean capital has everything and the labour has no leverage to bargain, and the rich will get richer and the poor will get poorer with almost zero class mobility? You may ask why rich need to get richer if AI produces so much that it’s enough for everyone? Well, I think one thing AI cannot replace is for some human to be tools to make other human feel superior and powerful. I know not every human is like that, but unfortunately it’s that kind of human to be motivated enough to stay in power.</p>

<p>So I don’t know what the future would look like. Even what would look like next year. Trump started the year to invade Venezuela. Would 2026 mid term election limit Trump or would Trump start something bigger to capture more power? How advance would the AI develop next year? Is there anything else than RAM (after GPU) that AI makes the average people not able to afford? With this much uncertainty, it can feel pointless to do things, like spend most time of the year to find a good long term home for the next decades. But the theme of life is uncertainty, always have been, starting from ancient times, which has been explored by lots of philosophers and religions. I cannot control all of those things but I’ll focus on the things I can control, which is taking care of my daughter and my family, continue my projects on distributed systems, reading and blogging. And I’ll explore to replace Claude Code with some open source tools like Open Code to not let a single company to control such an important tool in my workflow. Let’s see what future will give us.</p>]]></content><author><name></name></author><category term="life" /><summary type="html"><![CDATA[Here we are. The year of 2026!]]></summary></entry><entry><title type="html">Music Video Generation with AI</title><link href="https://www.binwang.me/2026-01-02-Music-Video-Generation-with-AI.html" rel="alternate" type="text/html" title="Music Video Generation with AI" /><published>2026-01-02T00:00:00-05:00</published><updated>2026-01-02T00:00:00-05:00</updated><id>https://www.binwang.me/Music-Video-Generation-with-AI</id><content type="html" xml:base="https://www.binwang.me/2026-01-02-Music-Video-Generation-with-AI.html"><![CDATA[<p>I did some music video generation experiments in November. Though lots of people are doing similar things nowadays and I’m not really doing anything unique, I got lots of joy and was kind of addicted to it for a few weeks. So I thought I’d record the experiments just to complete the missing piece of my blog posts in 2025, before I write the year end review blog.</p>

<h2 id="how-it-started">How It Started</h2>

<p>It started when there were some AI generated music videos that became popular on a Chinese video platform Bilibili. The videos use characters in a popular classic Chinese novel <a href="https://en.wikipedia.org/wiki/Journey_to_the_West">Journey to the West (西游记)</a>, and make them sing songs in a recording studio. The results are very impressive. <a href="https://www.bilibili.com/video/BV1tVsHznELh">Here</a> is an example of using Wukong (孙悟空) as the character.</p>

<p>That made me think: I like to write things. I write traditional Chinese <a href="/poetry">poems</a> but it’s hard to consume for average people and feels pedantic. I had always thought I could also write good lyrics, but they are useless without music. With examples of AI being so advanced to create really good music videos, I thought I’d give it a try. So this article is about the experiments I did, including the tools, models, GPU platforms and commercial products I explored. The end results are three music videos. Two of them I wrote the lyrics myself and one of them with ChatGPT generated lyrics. None of them got many views tho.</p>

<h2 id="music-generation-with-suno">Music Generation with Suno</h2>

<p><a href="https://suno.com/">Suno</a> is probably the most popular music generation platform at the moment. I tried it when it first came out a few years ago. It was already very impressive back then, but I found it was not so good at pronouncing Chinese words (even though its ability to generate Chinese songs surprised me). I played with the free credits and stopped after I spent all of them. With the new wave of music videos on the Internet and from the comments under the videos, seems like its new V5 model is much better than older models. So I gave it another try. With the free credits ran out pretty quick, I subscribed for one month.</p>

<p>The model is much better at Chinese pronunciation, even though still far from perfect: it often reads some pretty common words wrong: about 90% of the songs it generated had at least one incorrect pronunciation. But it’s useable and seems can be mitigated by using some other words/characters with the same pronunciation, which I never bothered to try.</p>

<p>You can control the song by adding some prompts in lyrics, enclosed by <code class="language-plaintext highlighter-rouge">[]</code>. Like the emotion, genre, instruments and so on. It needs a little bit prompt engineering, very much like image generation models in the early days. What I found really helpful is to use LLMs generate the prompts: I tell the LLM I want to generate music using Suno. I give it the lyrics and the goal I want. Then it can come up with some prompts with professional music terminologies. Then I try it in Suno, and if some parts are off, I come back and ask LLM again to change specific sections of the prompts. This feels like how people use the image generation models back in the day. I believe like those models, there will be more advanced music generation models with more natural prompting, like Nano Banana models nowadays.</p>

<p>Another thing I found helpful is to tune up the “Weirdness” in the advanced options, which can make the generated music less boring.</p>

<p>Even with all the tricks above, the quality of generated music is still like gambling. A slight change in prompts, lyrics or options can create very different result. I don’t need the music to be really good since what I care more is the lyrics, but I still want it to express in the way I imagined while writing the lyrics. In order to achieve that, it needs lots of attempts. It’s both frustration and addictive at the same time, which is also very much like gambling I guess. It’s the most time consuming part (regarding human involvement) in the process.</p>

<p>If it’s just for writing lyrics, music only is good enough for me. However, it’s pretty boring to just share an audio to the Internet. I also wanted to explore the capability of video generation models nowadays. So my journey continued.</p>

<h2 id="generate-longer-videos">Generate Longer Videos</h2>

<p>Nearly all the models nowadays can only generate short videos less than 1 minute, and the quality tends to be worse the longer the video is. But a song is at least 2-3 minutes. So the trick is to combine short videos into longer ones.</p>

<p>One way to do that is to first create some images for key frames. Then prompt the model to start with a key frame and end at another one. Then combine the videos through editing. The results from this method can be a hit or miss: sometimes the content in the clips doesn’t always match and it need more capable models and more attempts to get a good result.</p>

<p>This is where I found the cleverness in the videos I shared in the first section: the author generated the videos in a mostly static environment: a recording studio. So that it can avoid the situation of conflicting content in the clips. The whole video can be generated from only a single image, then merge the clips with some transition effect to smooth them out.</p>

<h2 id="video-generation-models">Video Generation Models</h2>

<p>From my research, <a href="https://wan.video/research-and-open-source">Wan</a> based models seems to be the most popular open source model family. 2.1 and 2.2 seems to be very mature and lots of tools support them, while 2.5 is the newest version that supposed to have better result.</p>

<p>There is a model <a href="https://github.com/MeiGen-AI/InfiniteTalk">InfiniteTalk</a>. I believe it’s based on Wan2.1 from the dependency models. It supports input of an image and a clip of audio, which is perfect for the music video use case when the character is mainly just singing in a recording studio.</p>

<h2 id="video-generation-tools">Video Generation Tools</h2>

<p>While we have the models, they are mostly not user friendly to use, not to mention some of them just have model files. We need tools to run them.</p>

<p>The most popular one may be <a href="https://github.com/comfyanonymous/ComfyUI">ComfyUI</a>. Surprisingly, there is no official Docker image for it. There are a few third party ones but I don’t really trust those, so I write <a href="https://github.com/wb14123/comfy-docker">a wrapper</a> myself. The UI is very similar to the node editors in 3D software, which lets you edit the workflow pipeline by dragging the nodes and connecting them by arrows. It may looks familiar enough for people work more closely with graphic, 3D models or video editing, but I find this approach really hard to use. I’d rather write a few lines of Python code instead of dragging the nodes to if-else branches and for-loops. More importantly, it doesn’t have a good dependency system: after importing a workspace and try to run it, there is not a good way to download all the models the workspace depends on, so it’s hard to reproduce the workflow other people shared. So I just tried with some official work flow with Wan 2.2 model, which uses the Comfy Cloud service to run the model. The result is fine but the price is too expensive. Because my dislike of the UI, I gave it up at last.</p>

<p>The next tool I found is <a href="https://github.com/deepbeepmeep/Wan2GP">Wan2GP</a>. It has lots of built in models and workflows including Wan 2.2 and InfiniteTalk. It’s much easier to use and requires less resource to run. The InfiniteTalk model it uses can theoretically generate infinite length videos by the method I talked in “Generate Longer Videos” above: it automatically use the last frames of the last clip to generate the next clip. But in order to generate longer videos, you need to tune the config in json file so that it allows you to generate longer clips on the UI.</p>

<h2 id="gpu-platforms">GPU Platforms</h2>

<p>I have a GPU locally but the whole story is kind of sad: I built the current machine <a href="/2016-06-19-Build-A-Computer-for-Deep-Learning.html">back in 2016</a> in the hope of doing some machine learning projects. And I did <a href="https://github.com/wb14123/seq2seq-couplet">train a model</a> to play Chinese couplets which got some attention at the time. When I came to Canada, because of the incompatible power plugs, I bought a cheap adapter from Amazon. Unfortunately, the moment I plugged in my machine, there were sparks coming from the adapter and the machine was dead. After some debugging, I found the most expensive part of the machine, the GPU was dead. Since there was lots of other things to do after moving to Canada, I didn’t feel the need to buy a replacement with a Nvidia 1050 at hand. Until pandemic hits and I wanted to do some machine learning projects again, and found GPU prices skyrocketed. After waiting and didn’t see the hope of price dropping down, I bought a 3080Ti at a very high price in 2021, only to wait for ChatGPT and following open source models to release, and found out 3080Ti doesn’t have enough memory to run a large enough model that is useful.</p>

<p>On the image generation side, it can run stable diffusion model with lower resolution. However, it’s far from enough for video generation. I tried it with Wan2GP which already uses less resource than other tools, but it makes my whole machine freeze. So I need to find a cloud GPU platform.</p>

<p><a href="https://www.runpod.io/">Runpod</a> is a popular platform from my research, so I gave it a try. I did get Wan2PG to run on it but the product has lots of rough edges. Sometimes the remote SSH ports doesn’t work. Sometimes the pods doesn’t start successfully with a custom image because of stuck on Docker image downloading, and worse, it charges money when this happens. It’s also not that cheap: ~$0.59/hour for a RTX 4090 pod. Wan2GP needs 1-2 hours to generate a 10-20 seconds video. So a music video can cost a few dollars.</p>

<p>I found some Chinese GPU platform cheaper in comparison. For example, <a href="https://www.xiangongyun.com/">xiangongyun.com</a> provides RTX 4090 and RTX 4090D GPUs. Yes you heard it right: 4090D is a Nvidia GPU only targeted to Chinese market because of US government’s ban on GPUs like 4090 in China. Ironically, the 4090D seems to be a better GPU for AI related tasks: it’s less powerful but also cheaper and drains less power with the same amount of memory. There are even modded versions which doubled the memory from 24G to 48G. The price on Xiangongyun is like:</p>

<ul>
  <li>¥1.89/h (~$0.27) for 4090 24G</li>
  <li>¥1.59/h (~$0.22) for 4090D 24G</li>
  <li>¥2.59/h (~$0.37) for 4090D 48G</li>
</ul>

<p>It’s much cheaper than Runpod. But it doesn’t support Docker images. Instead, it seems to target wider audience without software engineer background: it has a desktop environment like GUI, which lets you operation the pod and create custom images through that. Because of it’s locating in China, you also need ways to resolve the problems caused by <a href="https://en.wikipedia.org/wiki/Great_Firewall">GFW</a>. For example, setting up HuggingFace proxy with something like <code class="language-plaintext highlighter-rouge">export HF_ENDPOINT=http://hf.x-gpu.com </code>. Unfortunately, I failed to install Wan2PG on it because of some shared Nvidia library issue: it seems it mounts some Nvidia library through Docker and I cannot change it or install another version. There are lots of third party images including ComfyUI with many models pre-installed, but I didn’t feel it was trustworthy enough to run so I gave up at last.</p>

<p>Speaking of the GPU platform, I actually have built such a platform for a freelancer project around 2017. It uses Kubernetes under the hood, can create pods and allocate GPUs for the pods, supports custom Docker images and mounting file system for datasets, be able to view logs. It can also export service ports so that you can use things like Jupyter Notebook. From the experience of using the services above, I think the product I built back then was pretty advanced and I’m proud of it. Hopefully they can take good use of it.</p>

<h2 id="commercial-video-generation-products">Commercial Video Generation Products</h2>

<p>As noted above, the self hosted video generation with rented GPU is pretty expensive for generating a song: about a few dollars per song. So I also looked into some commercial providers. Lots of the providers say they have a free trail, but almost none of them can generate a video successfully without paying, including the official one from Wan. At last I found a Chinese provider. It’s called <a href="https://jimeng.jianying.com/">Jimeng (即梦)</a>, which is created by ByteDance, the company behind TikTok. If not considering the new user promotion, the price is not attractive, basically comparable to renting a GPU. The new user promotion is less than one dollar for the first month subscription. It comes with some credits when first subscribed, then free credits everyday for a month. For the initial credits, you can create videos for about 2/3 song. Then about 1/3 song for the free credits everyday. That’s at least a good price I can actually try to finish a music video, so I settled to it at last. It seems ironic that I settled with a commercial provider after so much efforts, but I’m glad I explored the possibilities.</p>

<h2 id="overall-workflow">Overall Workflow</h2>

<p>So summarizing the overall workflow to generate a music video with AI:</p>

<ul>
  <li>Write Lyrics.</li>
  <li>Put it into Suno to try. Optional use LLM to write prompts for Suno. Try until satisfied.</li>
  <li>Break the songs into clips whose length can be supported by the video generation model. Make sure the cut point is nature, for example, not in the middle of the singing.</li>
  <li>Create an image for the video generate input. I use ChatGPT or Nano Banana for this.</li>
  <li>Input audio clips and the image to the video generation model to generate multiple video clips.</li>
  <li>Merge the generated videos and add transition effect in between. I use Kdenlive for editing the videos.</li>
  <li>Sometimes the videos generated is longer than the input audio with some blank part at the beginning and end. Use video editing software to sync the audio and video.</li>
</ul>

<h2 id="results">Results</h2>

<p>As stated above, the results are 3 Chinese music videos: <a href="https://www.bilibili.com/video/BV1xp2uBHEP8">1</a>, <a href="https://www.bilibili.com/video/BV1DekSBhEAU">2</a> and <a href="https://www.bilibili.com/video/BV1XBCtBGEgG">3</a>. I think the results are good in general, other than some small things which I didn’t bother to fix because of the cost:</p>

<ul>
  <li>The model still cannot handle the hands properly. Sometimes there are 3 hands, some times there are 6 figures.</li>
  <li>The model still try to sync the lip even when there is only background music at the time.</li>
</ul>

<p>I had lots of fun creating them. I don’t know why but it was really addictive like video games. However, I couldn’t justify the costs after the new user promotion was over so I gave up after that.</p>

<p>During my research, I also find the video and image generation communities are really similar to the early day video gaming and modding communities. Lots of people may not be professional programmers, but learn to use the models and tools with passion, and maybe learned some level of programming on the way. There is no things like Git for proper sharing and version tracking, just binary files and model files everywhere, shared through some sketchy cloud drive providers. The tutorials are everywhere: on video platforms like Youtube, in forums like Reddit and so on. It’s a mess but so much fun.</p>]]></content><author><name></name></author><category term="AI" /><category term="GenAI" /><category term="Suno" /><category term="ComfyUI" /><category term="Wan" /><summary type="html"><![CDATA[I did some music video generation experiments in November. Though lots of people are doing similar things nowadays and I’m not really doing anything unique, I got lots of joy and was kind of addicted to it for a few weeks. So I thought I’d record the experiments just to complete the missing piece of my blog posts in 2025, before I write the year end review blog.]]></summary></entry><entry><title type="html">Home Network Setup with OpenWrt and VLANs</title><link href="https://www.binwang.me/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs.html" rel="alternate" type="text/html" title="Home Network Setup with OpenWrt and VLANs" /><published>2025-12-27T00:00:00-05:00</published><updated>2025-12-27T00:00:00-05:00</updated><id>https://www.binwang.me/Home-Network-Setup-with-OpenWrt-and-VLANs</id><content type="html" xml:base="https://www.binwang.me/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs.html"><![CDATA[<p>I moved to a new place recently. Using this as an opportunity, I re-configured the whole home network which I wanted to do long time ago. I put lots of thoughts into it. Even though I’m a software engineer for many years, I rarely touched the network layers below L3. So when researching and setting it up, I learned a lot on the way. This article shares the journey and the final setup I have.</p>

<h2 id="1-terminal-devices">1. Terminal Devices</h2>

<p>I basically have three categories of terminal devices in the network:</p>

<p>The first group is a cluster for self-host services. I talked about it in a previous blog <a href="/2023-03-13-Infrastructure-Setup-for-High-Availability.html">Infrastructure Setup for High Availability</a>. The devices in this cluster are trusted since they are mostly running open source software with regular updates. It’s also hard to setup firewall inside this cluster because of the network complexity of Kubernetes. So I want the devices in this cluster be able to talk with each other without any limits. Regarding Internet, they should all have Internet access. I also want the users outside of my home network be able to access the self hosted services. Some of them are going through Cloudflare tunnels so I don’t need to expose them to the Internet. But there are some other more privacy service that I don’t even want Cloudflare to see the traffic, or services with large traffic volume like file and photo sharing services, so I still need to expose some ports to the Internet directly.</p>

<p>The second group is the typical home user devices, like laptop, mobile phones, TV boxes, game consoles and so on. I want them to be able to connect to Internet, but better not be able to talk with or discover each other. They should also have access to the self hosted services in the first cluster.</p>

<p>The third group is the devices that hold sensitive data and I don’t trust at all. For example, security cameras. So for those devices, I just want devices in other groups to be able to talk with them, but the devices shouldn’t have any access to other devices at all, including the Internet.</p>

<p>Here is a diagram showing the three groups of the devices: the green arrow means access without limitation, and yellow arrow means limited access with firewall rules:</p>

<p><img src="/static/images/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs/home-network-requirement.png" alt="home-network-requirement" /></p>

<h2 id="2-physical-space">2. Physical Space</h2>

<p>This section is pretty irrelevant about the technical side of network setup, but about the home renovation instead. So feel free to skip this section if you are not interested. I’m just recording it for my own benefit since it’s something I spent lots of time thought about and learned a lesson from it.</p>

<p>I planned the network setup when doing basement finishing. I needed to figure out where everything goes so that I can know how to run the network cables. It’s easy to run the cables through the basement before the drywall is installed, and relatively easy to go to the first floor from basement. Luckily, most of my devices that needs wired connection will be at basement or the first floor. Second floor is mostly bedrooms which can be served through wifi from first floor. So the arrangement is pretty flexible.</p>

<p>However, I noticed the flexibility pretty late. The physical location of devices and the network topology kept intertwined in my mind, and the physical location of devices also depends on the furniture arrangement. So there were too many things in my brain at the same time. It’s hard to work things out that way until I realized I should figure out the network topology and the physical locations separately, then figure out how to run the cables based on that.</p>

<p>So here is the lesson, maybe obvious but important: plan every detail carefully before you start the renovation project. Because of the privacy reasons, I’ll leave the physical location of the devices out of this article.</p>

<h2 id="3-network-hardware-and-software">3. Network Hardware and Software</h2>

<h3 id="31-multiple-nics-vs-switch">3.1 Multiple NICs vs Switch</h3>

<p>Routers, Switches and some PCs all have multiple Ethernet ports. But they are different. Most switches and consumer level routers have the ports connected to a single SoC. In this case, the ports are not physically separated: devices connected to different ports can talk to each other at layer 2 network using MAC addresses. The typical firewall like iptables cannot enforce traffic at such layer (even though there are things like ebtables can do it). Managed switches and some routers can configure VLANs that force packets in different VLANs go through the router/switches to enforce the firewall, which we will explore later.</p>

<p>On the other hand, for some enterprise level routers and PCs with multiple network interface cards (NIC), every port is separated: you need the router to route the traffic through different NICs. It’s more secure but also needs more compute power. So the devices like this are much more expensive than a consumer level router.</p>

<h3 id="32-hardware-selection-approach">3.2 Hardware Selection Approach</h3>

<p>Currently all my terminal devices have 1Gbps Ethernet port at most. The incoming Internet is also 1Gbps. So I don’t feel the need to go crazy with 10Gbps routers since they are so much more expensive. 1Gbps port routers are good enough and 2.5Gbps ports are good to have. VLANs are also pretty mature so I don’t feel the need to buy devices with multiple NICs. At last, I want all the devices be able to run open source operating system, so I prefer routers with OpenWrt instead of managed switches.</p>

<p>I also want to re-use my existing network devices as much as possible. I already have a wireless OpenWrt router and an unmanaged switch. I feel them powerful enough for my use cases so it would be great if I can still use them in the new network setup.</p>

<p>Even though I only have 1Gbps ports, I still want Cat 6 cables since it’s pretty hard to change the cables in the wall in the future. And the price difference is also small enough to be accepted.</p>

<h3 id="33-opnsensepfsense-vs-openwrt">3.3 OPNsense/pfSense vs OpenWrt</h3>

<p>Lots of self-host communities mention pfSense and OPNsense a lot but I’ve never tried them. I have interests but the supported devices are pretty expensive. And with my 1Gbps devices, I don’t think I can justify the price.</p>

<p>On the other hand, I’ve been using OpenWrt since I owned my first router 14 years ago. I’m pretty familiar with it even though I don’t use most of the advanced features. Even though it’s more popular for wireless routers, I find the feature set enough for my use case. Maybe I’ll try OPNsense in the future but I’m using OpenWrt this time.</p>

<h2 id="4-network-topology-design">4. Network Topology Design</h2>

<p>So based on the device group in section 1 and the discussion in section 3, here are the network devices needed for my setup:</p>

<ul>
  <li>TP-Link ER605 for the main router which acts as a gateway to the Internet. This is the only new device I need to buy.</li>
  <li>TP-Link TL-SG105 unmanaged switch.</li>
  <li>TP-Link Archer C7 v5 wireless router.</li>
</ul>

<p>Here is the network topology:</p>

<p><img src="/static/images/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs/network-topology.png" alt="network-topology" /></p>

<p>There is still an unused port on the main router which can be used for another Wifi access point, but I don’t really need it for now.</p>

<h2 id="5-implement-the-network-topology-with-openwrt">5. Implement the Network Topology with OpenWrt</h2>

<p>So we have the topology and all the hardware connected. How do we implement the isolation as discussed in section 1? In this section, I will document how to do it in OpenWrt.</p>

<h3 id="51-vlan-configuration-for-wired-devices">5.1 VLAN Configuration for Wired Devices</h3>

<p><em>Note: The OpenWrt version I’m using for this configuration is 24.10.4. I believe the interface for configuring VLAN has been changed a little bit: in v23.x, it’s in Network -&gt; Switch instead of the steps I’ll share below. But the logic should be very similar.</em></p>

<p>In order to configure a different VLAN for a router’s port, go to Network -&gt; Interfaces -&gt; Devices. There should be a default bridge device called something like <code class="language-plaintext highlighter-rouge">br-lan</code>. Click “Configure”, enable VLAN filtering. If want to create a separate VLAN for a port, add a row and specify a different “VLAN ID”, make the target port “Untagged” and leave other ports blank. For example, in my configuration, I created a separate VLAN for each of the ports:</p>

<p><img src="/static/images/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs/openwrt-vlan-config.png" alt="openwrt-vlan-config" /></p>

<p>After saving this, the devices list should show something like <code class="language-plaintext highlighter-rouge">br-lan.20</code>, <code class="language-plaintext highlighter-rouge">br-lan.30</code> and so on. And the type for those devices would be “VLAN (802.1q)”.</p>

<p>Now it’s time to create interfaces from the newly created VLAN devices. Go to Network -&gt; Interfaces, and click on “Add new interface”, and use a newly created VLAN device above as the device. Configure other fields as you want like the protocol and so on. Make sure to use different subnets and assign the interface with a (new) firewall zone, which we will talk more in the next section.</p>

<p>At last, after verifying all the things are working, you can disable and optionally delete the <code class="language-plaintext highlighter-rouge">lan</code> interface (not the device) so that you can make sure the traffic is only going through the new VLAN interfaces.</p>

<h3 id="52-firewall-configuration">5.2 Firewall Configuration</h3>

<p>In order to create more firewall zones, go to Network -&gt; Firewall and create more zones there. I created a zone for each of the VLAN created above. For the untrusted devices that shouldn’t have the Internet access, don’t assign any zones in the forwards. For the devices that shouldn’t access other zones, just assign <code class="language-plaintext highlighter-rouge">wan</code> in the forwards. Note you can add more rules in “Traffic Rules” tab to override the default configuration here. The screenshot below is my configuration based on the access rule in section 1 and topology in section 4:</p>

<p><img src="/static/images/2025-12-27-Home-Network-Setup-with-OpenWrt-and-VLANs/openwrt-firewall-zones.png" alt="openwrt-firewall-zones" /></p>

<p>The wifi -&gt; wan forward is disabled by default with traffic rules added to access self host services.</p>

<p>Be aware of “Intra zone forward”: based on our discussion in section 3.1, even if you reject the intra zone forward, the devices can still talk to each other if they are in the same VLAN, or connected to the same Wifi. So far, we’ve explored how to isolate the devices with wired connections. In the following sections, we will explore how to isolate the devices connected wirelessly.</p>

<h3 id="53-wireless-clients-isolation">5.3 Wireless Clients Isolation</h3>

<p><em>Note: from now on, all the OpenWrt configurations are based on version 23.05.0 since that’s what I’m using for the Wifi router.</em></p>

<p>It’s easy to isolate the clients connected to the same Wifi. Go to Network -&gt; Wireless, edit the specific wireless, then click on the “Advanced Settings” in the “Interface Configuration” section and enable “Isolate Clients”. The clients connected to this Wifi shouldn’t be able to see each other now.</p>

<p>However, if there are still devices connected to the same router wired, the wireless clients and the wired clients can still see each other. In the following section, we’ll see how to prevent that.</p>

<h3 id="54-isolation-between-wired-and-wireless-clients">5.4 Isolation Between Wired and Wireless Clients</h3>

<p>In order to isolate wired and wireless clients, we need to create a new interface with a separate firewall zone and assign it to the Wifi. Here is how to do it:</p>

<p>First, go to Network -&gt; Interfaces -&gt; Devices, create a new device by clicking “Add device configuration…”, then create a device with type “Bridge device”. Do not need to assign any bridge ports to it since we are going to use it for Wifi. Name it something like <code class="language-plaintext highlighter-rouge">br-wlan</code>.</p>

<p>Then in Network -&gt; Interfaces -&gt; Interfaces, create a new interface with the device we just created. Configure the protocol as desired and assign it to a new firewall zone. Then you can configure the firewall zones like in section 5.2 to isolate the clients in different zones.</p>

<h2 id="6-conclusion">6. Conclusion</h2>

<p>For a long time, I didn’t feel good enough about my network setup since it is more open than I wanted, e.g. the mobile devices, TV box and self-host devices can see each other. I needed to configure complex firewall rule to protect self-host services and it’s not even enough with that. Now with the new setup, I feel much better. I know there is no 100% secure setup, especially messing them up with myself, there must be some places that’s not optimal, but I’m glad I can explore the areas I normally don’t have much opportunity otherwise. I believe it makes me understand the software and OS better, and more importantly, it is fun to set things like this in a home lab!</p>]]></content><author><name></name></author><category term="network" /><category term="OpenWrt" /><category term="Wifi" /><category term="VLAN" /><category term="security" /><summary type="html"><![CDATA[I moved to a new place recently. Using this as an opportunity, I re-configured the whole home network which I wanted to do long time ago. I put lots of thoughts into it. Even though I’m a software engineer for many years, I rarely touched the network layers below L3. So when researching and setting it up, I learned a lot on the way. This article shares the journey and the final setup I have.]]></summary></entry><entry><title type="html">Fix ZFS Linux Kernel Dependency on Arch Linux</title><link href="https://www.binwang.me/2025-12-17-Fix-ZFS-Linux-Kernel-Dependency-on-Arch-Linux.html" rel="alternate" type="text/html" title="Fix ZFS Linux Kernel Dependency on Arch Linux" /><published>2025-12-17T00:00:00-05:00</published><updated>2025-12-17T00:00:00-05:00</updated><id>https://www.binwang.me/Fix-ZFS-Linux-Kernel-Dependency-on-Arch-Linux</id><content type="html" xml:base="https://www.binwang.me/2025-12-17-Fix-ZFS-Linux-Kernel-Dependency-on-Arch-Linux.html"><![CDATA[<p><em>Update 2025-12-20: add section about using local repo.</em></p>

<h2 id="problem">Problem</h2>

<p>When updating an Arch Linux system, if you have some third-party repos added, the packages in them sometimes depend on an older package that’s not available in the official repos anymore. In such cases, pacman cannot upgrade the system unless you exclude the impacted package.</p>

<p>ZFS is an example of this. Since Arch Linux doesn’t ship ZFS packages in official repos, I added a third-party one <a href="https://github.com/archzfs/archzfs/wiki">archzfs</a>. However, ZFS support doesn’t always catch up with the newest kernel. When this happens, the upgrade will break.</p>

<p>There is another repo that’s supposed to host the matching kernel version. It can be added by the following section in <code class="language-plaintext highlighter-rouge">/etc/pacman.conf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[zfs-linux]
Server = http://kernels.archzfs.com/$repo/
</code></pre></div></div>

<p>However, it also often lags behind. See <a href="https://github.com/endreszabo/kernels.archzfs.com/issues/12">this Github issue</a> for the most recent example.</p>

<h2 id="manual-force-downgrade-packages">Manual Force Downgrade Packages</h2>

<p>Arch Linux has archives for old packages. There are command line tools like <a href="https://github.com/archlinux-downgrade/downgrade">downgrade</a> to install the packages from archives instead of from the repo. So we can install the desired version of dependencies with <code class="language-plaintext highlighter-rouge">downgrade</code>.</p>

<p>When installing a specific version with downgrade, if it breaks other packages, it will refuse to continue installing. You can resolve it by installing multiple packages at once in the dependency chain, for example:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>downgrade linux linux-headers
</code></pre></div></div>

<p>It will ask you version for each package.</p>

<p>However, it will fail to install since it will break the dependency of the current installed <code class="language-plaintext highlighter-rouge">zfs-linux</code> package. Even if you add <code class="language-plaintext highlighter-rouge">zfs-linux</code> to the downgrade list, it doesn’t check for the version that will be installed.</p>

<p>One way to resolve it is by removing <code class="language-plaintext highlighter-rouge">zfs-linux</code> first, then run the command above to install desired version of linux packages, then install the newer of version of <code class="language-plaintext highlighter-rouge">zfs-linux</code> back.</p>

<p>When you run <code class="language-plaintext highlighter-rouge">pacman -Syu</code> again, you will still get an error like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>error: failed to prepare transaction (could not satisfy dependencies)
:: installing linux (6.18.1.arch1-2) breaks dependency 'linux=6.17.9.arch1-1' required by zfs-linux
</code></pre></div></div>

<p>But it’s safe to ignore the kernel related packages now by using the <code class="language-plaintext highlighter-rouge">--ignore</code> flag:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>pacman <span class="nt">-Syu</span> <span class="nt">--ignore</span> linux <span class="nt">--ignore</span> linux-headers
</code></pre></div></div>

<p><strong>Important</strong>: you may have noticed that we include the package <code class="language-plaintext highlighter-rouge">linux-headers</code> in the commands above, even though pacman doesn’t complain if we don’t do that. That’s because in Arch Linux, <code class="language-plaintext highlighter-rouge">linux-headers</code> doesn’t depend on a specific version of <code class="language-plaintext highlighter-rouge">linux</code>. However, if you have a version mismatch, it may break some dkms modules. So it’s better to always keep them in sync.</p>

<h2 id="use-local-pacman-repo">Use Local Pacman Repo</h2>

<p>The approach described above can resolve the dependency conflicts, however, it always feels very risky to manually remove an important package like <code class="language-plaintext highlighter-rouge">zfs-linux</code> even it’s just temporarily. If we want to rely on pacman to resolve the dependencies without error, we can use local pacman repo. Here is how to do that:</p>

<p>First, still run the downgrade command above to select desired version of packages:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>downgrade linux linux-headers
</code></pre></div></div>

<p>It will download the packages to <code class="language-plaintext highlighter-rouge">/var/cache/pacman/pkg</code> even if it will not install the packages because of failed dependency check.</p>

<p>Then we can copy the downloaded packages to another folder that will be used as a local pacman repo:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo mkdir</span> /var/pacman-local-repo
<span class="nb">sudo cp</span> /var/cache/pacman/pkg/&lt;packages&gt; /var/pacman-local-repo
</code></pre></div></div>

<p>Then make it a valid repo:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /var/pacman-local-repo
<span class="nb">sudo </span>repo-add local-repo.db.tar.gz <span class="k">*</span>.pkg.tar.zst
</code></pre></div></div>

<p>At last add the local repo to <code class="language-plaintext highlighter-rouge">/etc/pacman.conf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[local-repo]
SigLevel = Optional TrustAll
Server = file:///var/pacman-local-repo
</code></pre></div></div>

<p>Then just run <code class="language-plaintext highlighter-rouge">pacman -Syu</code> and it will find the matching dependency versions in <code class="language-plaintext highlighter-rouge">local-repo</code>.</p>

<p>Remember to comment out the section after the upgrade is done, since you don’t want the old packages always in the config.</p>]]></content><author><name></name></author><category term="Linux" /><category term="ZFS" /><category term="dependency" /><summary type="html"><![CDATA[Update 2025-12-20: add section about using local repo.]]></summary></entry><entry><title type="html">A Rust CLI Program, Use It with LLM and Convert It to Web UI</title><link href="https://www.binwang.me/2025-12-10-A-Rust-CLI-Program.html" rel="alternate" type="text/html" title="A Rust CLI Program, Use It with LLM and Convert It to Web UI" /><published>2025-12-10T00:00:00-05:00</published><updated>2025-12-10T00:00:00-05:00</updated><id>https://www.binwang.me/A-Rust-CLI-Program</id><content type="html" xml:base="https://www.binwang.me/2025-12-10-A-Rust-CLI-Program.html"><![CDATA[<p>In the <a href="/2025-10-28-My-First-Rust-Project.html">last blog post</a>, I talked about my first Rust project. Recently, I finished another Rust CLI project called <a href="https://github.com/wb14123/rhyme-checker">rhyme-checker</a> that I wanted to build a long time ago. I tried it as a Claude Skill and got really good results. I also built a project <a href="https://github.com/wb14123/clap-web-gen">clap-web-gen</a> to convert it (and be able to convert many other Rust CLI projects) to a web page running in the browser without any backend. This blog post is about the journey of both projects.</p>

<h2 id="a-cli-program-to-check-rhymes-for-chinese-poetry">A CLI Program to Check Rhymes for Chinese Poetry</h2>

<p>There is a special kind of traditional Chinese poetry called “词”. It literally means “word” or “lyrics”. It’s a kind of poetry that you need to follow strict rules. There are many titles. Each one has strict meter, rhyme schema and tonal patterns. The rule is strict because they actually came with music and you can sing them in ancient times. But unfortunately the music has been lost and all that’s left are the lyrics.</p>

<p>I like to write such poetry because it’s really fun, and with the strict meters, it sounds beautiful by default. There are many of them on <a href="https://www.binwang.me/poetry">the poetry page</a>. But it’s hard to write because of the strict rules. I need to refer to the meters every time I write one. Sometimes there are just sentences coming out of my mind but I don’t know which title fits them the best. So many years ago, I tried to write a program to match the best title based on the text you give it. I wrote it in Javascript, and also partly because I didn’t understand the meter rules very well at the time, the project quickly became a disaster and I gave up at last. When I picked up Rust recently, I thought I could re-implement it with Rust as a CLI program. It’s almost pure algorithm without much IO, which could benefit from the performance of Rust.</p>

<p>The algorithm of this program is quite challenging, since the rules of the poetry titles are complex. For example, here are some constraints:</p>

<ul>
  <li>A title has a fixed number of sentences and each sentence has a fixed number of characters.</li>
  <li>Each Chinese character has a tone that can fall into two categories: 平 (Ping) or 仄 (Ze). In the meter rule, the character at each position must meet the requirement: it must be Ping, Ze, or either of them.</li>
  <li>Some characters at the end of sentences have requirements about the rhyme:
    <ul>
      <li>The rhyme also has tones of Ping and Ze. Characters with different tones are in different rhymes but can be in the same “rhyme group”. And the rules have requirements about whether the rhymes need to be in the same rhyme group, or must not be in the same rhyme group.</li>
      <li>There is not only one rhyme or rhyme group for a title: the rhyme and rhyme group can change from sentence to sentence.</li>
    </ul>
  </li>
</ul>

<p>So when searching for the best matching poetry title, it needs to loop over all the possible rhymes based on the input text, and try to put each line into different positions in different titles and see which one matches the best. I ended up using algorithms like depth-first search and dynamic programming for the searching. It may be the first time I used dynamic programming again after the programming contests in university. It feels great to use a language that you know what the cost of each operation is, like whether to create a new object for a variable or just use the reference to avoid copy and memory allocation, and so on.</p>

<p>The finished code is on <a href="https://github.com/wb14123/rhyme-checker">Github</a>. For a search with reasonably long input text, the time spent is consistently at about 80ms, and the memory usage peaked at about 5MB. I’m very happy with the result.</p>

<h2 id="use-the-cli-program-as-a-claude-skill">Use the CLI Program as a Claude Skill</h2>

<p>I tried to use AI for traditional Chinese literature a long time ago. I trained an RNN model about 10 years ago to <a href="https://ai.binwang.me/">write Chinese couplets</a> and it gained a lot of interest at the time. With every new language model starting from GPT 3, I tried to use them to write traditional Chinese poems. They were very bad at first, but getting better and better especially since Deepseek was released. But they still make lots of mistakes especially about meters and rhymes, which play a very very important part in traditional Chinese poetry. Even when letting the LLM review a poem, such mistakes happen all the time.</p>

<p>I tried to train some LoRA with my own dataset. Also thought about using reinforcement learning with the meter matching as the score/cost. But because of the cost of training, I gave up both in the end.</p>

<p>However, there is a much cheaper option: let the existing LLM use external tools and use its reasoning ability to figure everything out. With the CLI written, which can query tones, rhymes, title rules and matching scores, we have such a tool. Claude Skill makes such tool usage very simple: just a markdown file to describe the tool and how to use it, along with the binary and that’s all it needs. So I created <a href="https://github.com/wb14123/rhyme-checker/blob/master/.claude/skills/rhyme-checker/SKILL.md">a markdown file</a> and tried it with Claude Code, and the result was really good! It can check the poem it wrote and fix it until it passes the checker. Sometimes it needs more time and can consume lots of tokens, but most of the time it can finish with a perfect matching score, while maintaining reasonably good content. Later I found out the Claude web UI also lets you upload a zip to create a new skill, so I did that and now I can use it anywhere.</p>

<p>The result makes me both excited and surprised. Traditional Chinese poetry is something that very few Chinese people can write nowadays. But maybe it’s just a lack of interest which is not a problem for LLMs. After I created the skill, when I finish a conversation with Claude, I sometimes let it write a poem to summarize the conversation, and it’s really entertaining.</p>

<h2 id="convert-rust-cli-to-web-ui">Convert Rust CLI to Web UI</h2>

<p>I really like the CLI program I wrote. But sometimes I need to run it on mobile phones. So I wrote a program to convert any Rust CLI program to web UI, given the program can be compiled to WASM and uses <a href="https://docs.rs/clap/latest/clap/">clap</a> for CLI args parsing. The code is also on <a href="https://github.com/wb14123/clap-web-gen">Github</a>. The usage is very simple: you just add the dependencies, put the main logic into a function that takes a clap structure as input, and add a macro onto it. Oh you also need to replace all the <code class="language-plaintext highlighter-rouge">print!</code> and <code class="language-plaintext highlighter-rouge">println!</code> since WASM cannot handle the stdout.</p>

<p>This is a fairly complex program as well since it uses macros and also needs to parse the structure and map them into HTML elements. But the complexity is different from the CLI tool I wrote: for the CLI tool, it’s mostly the algorithm, but for this one, it’s just more tedious once the approach was figured out. So I used Claude Code heavily in this project. I can check the result very easily and it runs purely on the client side, so there is less concern about not writing all the code by myself. It saved me lots of time both in the prototype and the implementation.</p>

<p>With my interest in Rust, I believe I will write more CLI programs and this tool will make it easier for me to run them everywhere and share them with other people. I’m looking forward to it!</p>]]></content><author><name></name></author><category term="Rust" /><category term="web" /><category term="CLI" /><category term="WASM" /><category term="LLM" /><category term="AI" /><category term="Claude" /><summary type="html"><![CDATA[In the last blog post, I talked about my first Rust project. Recently, I finished another Rust CLI project called rhyme-checker that I wanted to build a long time ago. I tried it as a Claude Skill and got really good results. I also built a project clap-web-gen to convert it (and be able to convert many other Rust CLI projects) to a web page running in the browser without any backend. This blog post is about the journey of both projects.]]></summary></entry><entry><title type="html">My First Rust Project</title><link href="https://www.binwang.me/2025-10-28-My-First-Rust-Project.html" rel="alternate" type="text/html" title="My First Rust Project" /><published>2025-10-28T00:00:00-04:00</published><updated>2025-10-28T00:00:00-04:00</updated><id>https://www.binwang.me/My-First-Rust-Project</id><content type="html" xml:base="https://www.binwang.me/2025-10-28-My-First-Rust-Project.html"><![CDATA[<h2 id="less-active-scala-community">Less Active Scala Community</h2>

<p>Rust is getting more and more popular. It targets low level high performance programming because of zero cost abstraction. For example, it doesn’t have any runtime, thus no GC. I have learned C/C++ and used C for programming contests back in university, but in most of my professional career, I use higher level languages like Java, Scala, Python, Go, and so on. All of them have GC built in, so I rarely need to think about memory allocation. For side projects, I have a few criteria to consider when selecting a language:</p>

<ul>
  <li>The language itself: powerful and flexible syntax. Better to have a type system so more errors can be caught at compile time.</li>
  <li>Good tools to debug and profile the program.</li>
  <li>Mature libraries.</li>
  <li>Active communities.</li>
</ul>

<p>JVM is a good platform that has wonderful tools and lots of libraries thanks to the popularity of Java. But I really don’t like Java the language itself. Thankfully, there are lots of other languages available on JVM, including the one I like most: Scala. Scala was popular when Spark was popular. However, it has been trending down for many years.</p>

<p>With fewer and fewer users, the community is less and less active. It’s a negative feedback loop. Many widely used libraries are only maintained by a few people and are not very active anymore. The popular paradigm changed from actor mode with libraries like Akka to functional programming frameworks like Cats and ZIO. I like Cats a lot, but I must admit not everyone can catch the paradigm change. Even for Cats Effect itself, the migration from 2.x to 3.x changed lots of things, and even the most basic IO type doesn’t have documentation yet. Projects written in Scala can feel like completely different languages based on the paradigm and libraries they use. Divisions like this make the already small community even smaller, not to mention some drama between Cats and ZIO.</p>

<p>So even though I still like the language, and I’ve written lots of my personal projects with it, I started to think twice when starting new projects. Sometimes I find the smaller community starts to hurt productivity when I run into something strange, and the complexity of the language makes it hard to read the library code in order to understand what’s going on internally. With the <a href="https://wiki.openjdk.org/display/loom/Main">Loom project</a>, I feel like I may write my own light wrappers around virtual threads with functional programming instead of using Cats Effect. But that’s another topic.</p>

<p>The development of LLM makes the problem worse. With a less active community and fewer documents, LLM has less data to train on the language, which makes it less good at writing code with Scala compared to other languages. Even though I don’t think LLM can write code all by itself yet, it’s still a good tool, and it’s a disadvantage if it doesn’t work well.</p>

<p>Nevertheless, I’ll still use Scala in my projects for the foreseeable future. But I want to learn and see something new.</p>

<h2 id="the-failed-attempt-at-learning-rust">The Failed Attempt at Learning Rust</h2>

<p>I was interested in Rust in its early days. I started to read the Rust book around 2016 and tried to write some Rust code in the following 2 years. Especially after I joined a database company. I wanted to write some database related things by myself. I want the performance to be as good as possible, so I want a language without GC. Rust felt like a natural choice at the time. So I attempted to implement an <a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">LSM tree</a> and started with a <a href="https://en.wikipedia.org/wiki/Skip_list">skip list</a>, which is a popular data structure to implement the memtable part of an LSM tree. It’s like a linked list but more complex. Oh boy, was I wrong to try that for my first Rust project. Rust’s pointers are really bad at self referencial structures if you don’t want to use <code class="language-plaintext highlighter-rouge">unsafe</code>. Writing a linked list with Rust remains to be a hard problem nowadays. It almost becomes a meme for Rust new learners, very much like the joke about how to exit Vim. So it’s not a surprise that I failed miserably and gave up at last. My interest in databases also transferred from high performance data structures to the correctness of distributed systems, so I didn’t look back at Rust for a long time. Until recently, I wanted to write some database related things again. With more concerns with Scala and with Rust being more and more popular, I decided I must learn it.</p>

<h2 id="the-first-project">The First Project</h2>

<p>Learning from the first failed try, my goal this time is to implement something easy enough that doesn’t need to handle low level memory structures. I just want to get a sense of how Rust feels when just writing some “regular” programs. In order to keep my interest, the project needs to be fun. After thinking through a few candidates, I ended up writing a program to talk with multiple customizable LLM bots at the same time: you can create the bot profile and add them to a chat room. When chatting, the LLM decides which bot replies next and then uses that bot’s profile to reply. The code is in my GitHub repo <a href="https://github.com/wb14123/v-world-cli">v-world-cli</a>.</p>

<p>This projects meets all the requirements above. While being lots of fun, I can explore things like reading files, network requests, async programming, streaming and so on. If I want, I can continue to explore things like database integration and web server. I just implemented the minimal feature set. Lots of the features like saving the conversation history, room profile, agent memory and so on are not implemented, and probably never will be.</p>

<h2 id="learn-with-the-help-of-llm">Learn with the Help of LLM</h2>

<p>LLM coding agents make the learning process much faster, in mainly two ways. First, you can implement some non-important components of the projects with LLM to see the results faster and refine them later manually. This makes the feedback loop much shorter. It’s great to see some results when implementing something instead of waiting until every component fits together. For example, in my case, I don’t really care about the UI part, so I just let the LLM create the module based on the interface I’ve already defined. Similarly, after I defined the interface, I also let it write the code for integrating with OpenAI compatible APIs so that I don’t need to create all the API structures and look for the documentation. This way I can chat with the bots much earlier and be motivated by the result.</p>

<p>The second is the more traditional usage of the LLM coding tool: letting it fix the problems in the code. Rust is notorious for the difficulty of ownership reasoning and many pointer types. With LLM’s help, I can come up the fix much faster. It doesn’t always come up with the best answer, but with it pointing a direction, I have a starting point to research instead of not knowing where to look. For example, there is a <code class="language-plaintext highlighter-rouge">Pin</code> type sometimes needed with async functions. Without LLM’s help, I think I would spend much longer time realizing I need to use it to fix some compile errors.</p>

<h2 id="the-beauty-and-ugly-of-rust">The Beauty and Ugly of Rust</h2>

<p>The beauty of Rust is that with the help of its type system, the language can feel like a GC language since you don’t need to explicitly free the memory. It binds the lifetime of the variables to a scope and provides different types to manage it when you want it to outlive the closure scope.</p>

<p>The ugly of Rust also usually comes from it. Sometimes you just feel the types are just in your way of implementing something. You need to know too many implementation details in the libraries in order to fight the compile errors. The abstraction of the library is leaking. For example, sometimes you need to use the <code class="language-plaintext highlighter-rouge">Pin</code> type because of some implementation details of <code class="language-plaintext highlighter-rouge">async</code>.</p>

<p>Other than the ownership system, some syntax of Rust feels ugly compared to languages like Scala. For example, in Scala, in operations like <code class="language-plaintext highlighter-rouge">map</code> and <code class="language-plaintext highlighter-rouge">filter</code>, if you only need the variables once, you can use <code class="language-plaintext highlighter-rouge">_</code> instead of defining the parameter list in the anonymous function, e.g.,</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">arr</span><span class="o">.</span><span class="py">map</span><span class="o">(</span><span class="k">_</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="c1">// add 1 to all the elements</span>
</code></pre></div></div>

<p>With Rust, you need to write the parameters explicitly:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">arr</span><span class="nf">.map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="n">x</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>
<p>Or if the type doesn’t implement <code class="language-plaintext highlighter-rouge">map</code> and you need to use <code class="language-plaintext highlighter-rouge">iter</code>:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">arr</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="n">x</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="nf">.collect</span><span class="p">();</span>
</code></pre></div></div>

<p>This is more obvious with <code class="language-plaintext highlighter-rouge">match</code>. In Scala, you don’t need to use the <code class="language-plaintext highlighter-rouge">match</code> keyword all the time:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">results</span><span class="o">.</span><span class="py">map</span> <span class="o">{</span>
  <span class="k">case</span> <span class="nc">Some</span><span class="o">(</span><span class="n">r</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="s">"I have a result"</span>
  <span class="k">case</span> <span class="nc">None</span> <span class="k">=&gt;</span> <span class="s">"empty result"</span>
<span class="o">}</span>
</code></pre></div></div>

<p>But with Rust:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span><span class="nf">.map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="k">match</span> <span class="n">x</span> <span class="p">{</span>
  <span class="nf">Some</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="s">"I have a result"</span>
  <span class="nb">None</span> <span class="k">=&gt;</span> <span class="s">"empty result"</span>
<span class="p">})</span>
</code></pre></div></div>

<p>With that said, the syntax of Rust is already much better than lots of other languages, so it’s really not a big deal.</p>

<h2 id="different-pointer-types">Different Pointer Types</h2>

<p>I’m still learning the best practice of using different pointer types at the right time, but here are the ones I used in my first project:</p>

<p><code class="language-plaintext highlighter-rouge">&amp;</code> for passing a reference without changing ownership. Use this when possible since it has the least overhead and is the simplest. <code class="language-plaintext highlighter-rouge">&amp;mut</code> is its mutable version.</p>

<p>But sometimes the lifetime of the value will live outside of the current scope, especially when using <code class="language-plaintext highlighter-rouge">async</code> with <code class="language-plaintext highlighter-rouge">spawn</code>, which doesn’t guarantee the new process to be finished before the current scope is ended. In this case, <code class="language-plaintext highlighter-rouge">Arc</code> is very useful. This type tracks the reference count of the variable and frees it when the reference count is zero, very much like GC for other languages, but the overhead is much lower because it tracks and frees memory at the end of the scope instead of having a separate process to scan all the variables. The references are read only. If you need a writable structure, then use it with <code class="language-plaintext highlighter-rouge">RwLock</code> like <code class="language-plaintext highlighter-rouge">Arc&lt;RwLock&lt;T&gt;&gt;</code>.</p>

<p>At last, <code class="language-plaintext highlighter-rouge">Box</code> for things that don’t know the size at compile time. For example, the most common use case is the error part in <code class="language-plaintext highlighter-rouge">Result</code>: <code class="language-plaintext highlighter-rouge">Result&lt;T, Box&lt;dyn Error&gt;&gt;</code>.</p>

<h2 id="whats-next">What’s Next</h2>

<p>I want to explore macros in Rust to see if I can implement an easy to use Raft library. Then maybe implement some toy distributed file system based on it.</p>]]></content><author><name></name></author><category term="Rust" /><category term="LLM" /><category term="async" /><summary type="html"><![CDATA[Less Active Scala Community]]></summary></entry><entry><title type="html">Download Message Images from Seesaw</title><link href="https://www.binwang.me/2025-10-06-Download-Message-Images-from-Seesaw.html" rel="alternate" type="text/html" title="Download Message Images from Seesaw" /><published>2025-10-06T00:00:00-04:00</published><updated>2025-10-06T00:00:00-04:00</updated><id>https://www.binwang.me/Download-Message-Images-from-Seesaw</id><content type="html" xml:base="https://www.binwang.me/2025-10-06-Download-Message-Images-from-Seesaw.html"><![CDATA[<p><a href="https://seesaw.com/">Seesaw</a> is a learning experience platform for elementary education. My daughter’s daycare uses it for communication with us. They also took pictures of her and send them through Seesaw every day. When my daughter transfers to a new daycare, I want to download all the photos before my account is deactivated. However, with its message history download tool, it can only download the messages as a PDF, and only has the first image’s thumbnail if a message has multiple images. So I wrote a tool to download all the raw images. If anyone else needs it, <a href="https://github.com/wb14123/seesaw-message-images-downloader">here</a> is the repo, which has all the source code and documentation.</p>

<p>The tool I wrote is just some Javascript code to run in the browser instead of an actual tool that takes some parameters and automatically downloads all the images. It’s supposed to be thrown away after this one-time usage so I didn’t put too much energy into making it clean and robust. Claude Code is good for writing things that don’t need much maintenance. But it still needs a lot of guidance in this case because of the messy HTML it needs to parse. I did several iterations to make it be able to find the URL of raw images.</p>

<p>When running the code, it clicks on each message to open the image viewer, then clicks the next button to loop over the images to get the URL. I let it sleep 5 seconds after the clicks to wait for the elements to be fully loaded. So all the past photos are shown like slides when I run the code. It’s such a special feeling to watch it running. From the early photos where she always has watery eyes because of crying, to eating lunch by herself, playing on the playground, and making all kinds of arts. It’s hard to believe how much a toddler can grow in just more than half a year. And it’s not only the kid who is growing. It feels like I also re-experienced the feelings when I was a kid, and learned to conquer those feelings again, and grow with her in the process.</p>

<p>Moving to a new place, my daughter goes to a new daycare, which has a pretty different principle and practice than the old one. Both she and I are adapting again. But I have more confidence because of the experience, and I believe she does, too.</p>]]></content><author><name></name></author><category term="life" /><category term="seesaw" /><category term="javascript" /><summary type="html"><![CDATA[Seesaw is a learning experience platform for elementary education. My daughter’s daycare uses it for communication with us. They also took pictures of her and send them through Seesaw every day. When my daughter transfers to a new daycare, I want to download all the photos before my account is deactivated. However, with its message history download tool, it can only download the messages as a PDF, and only has the first image’s thumbnail if a message has multiple images. So I wrote a tool to download all the raw images. If anyone else needs it, here is the repo, which has all the source code and documentation.]]></summary></entry><entry><title type="html">My Workflow to Review Articles with LLMs</title><link href="https://www.binwang.me/2025-08-15-My-Workflow-to-Review-Articles-with-LLMs.html" rel="alternate" type="text/html" title="My Workflow to Review Articles with LLMs" /><published>2025-08-15T00:00:00-04:00</published><updated>2025-08-15T00:00:00-04:00</updated><id>https://www.binwang.me/My-Workflow-to-Review-Articles-with-LLMs</id><content type="html" xml:base="https://www.binwang.me/2025-08-15-My-Workflow-to-Review-Articles-with-LLMs.html"><![CDATA[<p>Since LLMs started to get popular, I find one of their strengths is to review things like code and writings. The strength is related to its weakness: it can hallucinate and just make up things but sound confident. However, when it reviews your own writings, you know the subject best, so if the LLMs are making up things, you’ll catch that. After trying Claude Code at work, I found it useful and subscribed to the Pro version for personal use. With its CLI interface, it’s really easy to write scripts to incorporate it into my workflow. In this article, I’ll talk about how I use it to review my blog posts. It’s very straightforward to use LLMs since you just interact with them using natural language, but I think there are some details worth sharing.</p>

<p><em>As a side note, I created a new category “Machine Learning Application” and moved some of my old posts into this category, since I think just using LLMs with prompts wouldn’t count as “Machine Learning.”</em></p>

<h2 id="correctness-review">Correctness Review</h2>

<p>I uses a multi-round review process.</p>

<p>First, check correctness. If there is something technically wrong, a large part of the article will likely need to be reworked. Therefore, it’s a waste of work to focus on minor details before addressing the fundamental issues.</p>

<p>The correctness check covers areas such as mathematics, algorithms, system designs and fact checks. I only want the LLM to tell me what is wrong so I can fix it myself instead of letting it do the fix for me. The prompt I use is on <a href="https://github.com/wb14123/blog/blob/master/review/correctness-review.md">my Github</a>. It may continue to evolve, but I’ll copy what I have for now:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Guide for Correctness Check</span>
<span class="p">
*</span> Check the technical correctness thoroughly. Think about all the scenarios.
<span class="p">*</span> The technical correctness includes but not limited to these things:
<span class="p">  *</span> Code example and syntax.
<span class="p">  *</span> Algorithm descriptions.
<span class="p">  *</span> Technical concepts and definitions.
<span class="p">  *</span> System architecture designs and claims.
<span class="p">  *</span> Mathematical formulas.
<span class="p">*</span> Check facts in the articles. Search the internet if possible. Use reliable sources like reputable news organizations, official websites of the software mentioned and so on.
<span class="p">*</span> Check the terminologies used in the articles are correct and used in a correct way.
<span class="p">*</span> Check if there is confusing, not so clear description in the article.
<span class="p">*</span> Break down into multiple sections if needed, and discuss them one by one with the user.
<span class="p">*</span> All the links in the article without host name are paths under the domain <span class="sb">`https://binwang.me`</span>. You can fetch remote web pages if needed.
<span class="p">*</span> Ignore grammar errors and the checks mentioned in @grammar-review.md.
<span class="p">*</span> If there are correctness problems, respond with the problems in the chat instead of modifying the article files themselves.
</code></pre></div></div>

<p>I save the prompt into a file. One nice thing about Claude Code CLI is that you can use <code class="language-plaintext highlighter-rouge">@</code> to refer to local files. So I write a script to invoke it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">set</span> <span class="nt">-x</span>

<span class="nv">file</span><span class="o">=</span><span class="nv">$1</span>

claude <span class="s2">"Review @</span><span class="nv">$file</span><span class="s2"> based on the review guide @correctness-review.md"</span><span class="nb">.</span>
</code></pre></div></div>

<p>It may seem excessive to create a script for just one line of code. But it saves my time to write the same prompt repeatedly. It’s also easier to swap <code class="language-plaintext highlighter-rouge">claude</code> to some other tool in the future if needed.</p>

<h2 id="phrasing-review">Phrasing Review</h2>

<p>The next stage is for phrasing and sentence flow review. It reviews issues such as awkward phrasing, unnatural sentences. This stage focuses on the text and wording itself instead of the logic or idea. I tried to let it review the structure as well, but its suggestions were so extensive that accepting them would make the article feel like it was written by someone else. I find reviewing the text itself strikes a good balance. Again, in this stage, I only let LLMs give me suggestions instead of editing the article by themselves. The prompt is on <a href="https://github.com/wb14123/blog/blob/master/review/phrasing-review.md">my Github</a> too:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Guide for Phrasing and Sentence Structure Review</span>
<span class="p">
*</span> Review the article for awkward phrasing, unnatural sentence flow, and syntax that sounds weird or unsmooth to native speakers.
<span class="p">*</span> Focus on sentence structure, word choice, and natural language flow rather than grammar rules (checked in @grammar-review.md) or technical correctness (checked in @correctness-review.md).
<span class="p">*</span> Make sure phrases and expressions sound natural and idiomatic in English.
<span class="p">*</span> DO NOT change the file.
<span class="p">*</span> Discuss every found issue with the user:
<span class="p">  *</span> Describe the issue with reference to the text.
<span class="p">  *</span> Give suggestions for the fix.
<span class="p">  *</span> Discuss 3-5 issues at a time if there are too many issues.
</code></pre></div></div>

<p>And the prompt is wrapped in a similar script as before.</p>

<p>Sometimes I find it still tries to find lots of grammar errors even when I tell it not to do so. If that happens, maybe try to do the next stage of grammar review and then come back.</p>

<h2 id="grammar-review">Grammar Review</h2>

<p>Finally, I let LLMs review the grammar, spelling and other typos. In this stage, I trust the LLM enough to make it actually fix the issues. But instead of editing it in place, I let it create a new file. In theory, I could <code class="language-plaintext highlighter-rouge">git add</code> or <code class="language-plaintext highlighter-rouge">git commit</code> the files so that I can review the diff but sometimes it can be forgotten, so creating a new file is the safest way.</p>

<p>The prompt is also on <a href="https://github.com/wb14123/blog/blob/master/review/grammar-review.md">my Github</a> but I find something like “Fix the typos in [article]” is good enough most of the time.</p>

<p>Again, a similar script is created to wrap the prompt. The difference is I use vimdiff to open the diff at the end of the script.</p>

<h2 id="diff-tool">Diff Tool</h2>

<p>I use diff to review the changes LLM made for the grammar review. But there are some changes needed to make life easier.</p>

<p>The first thing is to make it highlight the diff at word level. There is a plugin <a href="https://github.com/rickhowe/diffchar.vim">diffchar</a> to do that.</p>

<p>Then I changed some default colors for vim highlight:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>highlight DiffAdd    ctermbg=22  ctermfg=white guibg=#003300 guifg=white
highlight DiffChange ctermbg=16 ctermfg=white guibg=#001c65 guifg=white
highlight DiffDelete ctermbg=52  ctermfg=red   guibg=#330000 guifg=#ff6666
highlight DiffText   ctermbg=53  ctermfg=white guibg=#330033 guifg=white
</code></pre></div></div>

<p>Also set up the line wrap and read only mode (for <code class="language-plaintext highlighter-rouge">git difftool</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>autocmd VimEnter * if &amp;diff
autocmd VimEnter *   windo set wrap
autocmd VimEnter *   windo set noreadonly
autocmd VimEnter * endif
</code></pre></div></div>

<h2 id="fix-all-the-typos-in-the-past">Fix All the Typos in the Past</h2>

<p>Since I find Claude does a good job fixing the grammars in the articles, I wrote a script to fix all the issues in my past blogs:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

<span class="nb">set</span> <span class="nt">-e</span>

<span class="nv">START_FILE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>

<span class="k">for </span>f <span class="k">in</span> <span class="sb">`</span><span class="nb">ls</span> <span class="nt">-r</span> jekyll/_posts<span class="sb">`</span> <span class="p">;</span> <span class="k">do
	if</span> <span class="o">[</span> <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$START_FILE</span><span class="s2">"</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="o">[[</span> <span class="s2">"</span><span class="nv">$f</span><span class="s2">"</span> <span class="o">&gt;</span> <span class="s2">"</span><span class="nv">$START_FILE</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
		</span><span class="nb">echo</span> <span class="s2">"Skip </span><span class="nv">$f</span><span class="s2"> since it's not after </span><span class="nv">$START_FILE</span><span class="s2">"</span>
		<span class="k">continue
	fi

	</span><span class="nv">p</span><span class="o">=</span><span class="s2">"jekyll/_posts/</span><span class="nv">$f</span><span class="s2">"</span>
	<span class="nb">echo</span> <span class="s2">"Fixing </span><span class="nv">$f</span><span class="s2"> ..."</span>
	claude <span class="nt">-p</span> <span class="s2">"Fix typos in @</span><span class="nv">$p</span><span class="s2"> . Never change the file name, links in the article. Skip and exit if it's not an English article."</span>
<span class="k">done</span>
</code></pre></div></div>

<p>You may notice that I don’t like Claude Code to figure out all the typos in one session, but create separate sessions for each article. Since Claude will take the shortcut and not actually do a good job if I ask it to fix all at once. Be aware fixing them one by one will use lots of tokens. But I just leave it running overnight for a few days so the quota will be recovered during the day time.</p>

<p>If there are many articles, the quota will be all consumed without fixing all of them yet. So I take a parameter for the script to let it know the current progress.</p>

<p>This is a one-liner to review the changed files with <code class="language-plaintext highlighter-rouge">git difftool</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">f</span><span class="o">=</span><span class="sb">`</span>git diff <span class="nt">--name-only</span> | <span class="nb">tail</span> <span class="nt">-1</span><span class="sb">`</span> <span class="p">;</span> git difftool <span class="nv">$f</span> <span class="o">&amp;&amp;</span> git add <span class="nv">$f</span>
</code></pre></div></div>

<p>It will open a file that’s not staged yet with the configured git difftool (vimdiff in my case). Once you review it and close the difftool, it will use <code class="language-plaintext highlighter-rouge">git add</code> to stage it so the next file opened will be a different one.</p>

<p>I think it really did a good job fixing my articles. <a href="https://github.com/wb14123/blog/commit/05af92e6efb310f5600d6704bb8be87ea9d0d893">Here</a> is one example of the running batch.</p>

<h2 id="mental-model">Mental Model</h2>

<p>I want to emphasize the mental model I use when having LLMs review my articles: I treat it as the last line of defense. I only make the LLM review the article once I think it’s good myself and want an additional eye on it. And only make it fix grammars by itself. And even for the grammars, I’ll review the result so that I can learn from the errors.</p>]]></content><author><name></name></author><category term="AI" /><category term="Machine Learning" /><category term="LLM" /><category term="blog" /><summary type="html"><![CDATA[Since LLMs started to get popular, I find one of their strengths is to review things like code and writings. The strength is related to its weakness: it can hallucinate and just make up things but sound confident. However, when it reviews your own writings, you know the subject best, so if the LLMs are making up things, you’ll catch that. After trying Claude Code at work, I found it useful and subscribed to the Pro version for personal use. With its CLI interface, it’s really easy to write scripts to incorporate it into my workflow. In this article, I’ll talk about how I use it to review my blog posts. It’s very straightforward to use LLMs since you just interact with them using natural language, but I think there are some details worth sharing.]]></summary></entry><entry><title type="html">Why Consensus Shortcuts Fail in Distributed Systems</title><link href="https://www.binwang.me/2025-08-13-Why-Consensus-Shortcuts-Fail.html" rel="alternate" type="text/html" title="Why Consensus Shortcuts Fail in Distributed Systems" /><published>2025-08-13T00:00:00-04:00</published><updated>2025-08-13T00:00:00-04:00</updated><id>https://www.binwang.me/Why-Consensus-Shortcuts-Fail</id><content type="html" xml:base="https://www.binwang.me/2025-08-13-Why-Consensus-Shortcuts-Fail.html"><![CDATA[<p>There are some battle-tested consensus algorithms like Paxos and Raft to ensure a consistent view will be reached in a distributed system, even with scenarios like node failure, network partition, clock skew, and so on. There are existing systems like etcd and Zookeeper that implement these algorithms so you can use them as external systems. But sometimes, you may need to embed the algorithm into your own system instead of relying on a third-party one (it may be more common than you think—more on that in my next blog). Sometimes, because of the limitations of existing systems or other reasons, it may be hard to fully implement the algorithm, so people may tend to cut some corners and think it will be okay. While Raft and Paxos are not the only consensus algorithms, creating new ones is error-prone. This article explores how partial implementations fail, even when leveraging existing Raft systems incorrectly. We will use Raft as an example instead of other algorithms like Paxos, since it’s easier to understand and has better descriptions of real-world systems.</p>

<h2 id="goal-of-the-system">Goal of the System</h2>

<p>Before going to the details, let’s define what the goal of the system or algorithm we are going to implement is, because a consensus system can mean many things.</p>

<p>I’ll quote from Raft’s paper <a href="https://raft.github.io/raft.pdf">In Search of an Understandable Consensus Algorithm</a> for the properties of the system:</p>

<blockquote>
  <p>Consensus algorithms for practical systems typically have the following properties:</p>

  <ul>
    <li>They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.</li>
    <li>They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.</li>
    <li>They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.</li>
    <li>In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.</li>
  </ul>
</blockquote>

<p>To be clearer, by “ensure safety”, it means to guarantee <a href="https://jepsen.io/consistency/models/linearizable">linearizability</a>. And by non-Byzantine conditions, I’ll quote from the Paxos paper <em>Paxos Made Simple</em>:</p>

<blockquote>
  <p>We use the customary asynchronous, non-Byzantine model, in which:</p>

  <ul>
    <li>Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted.</li>
    <li>Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted.</li>
  </ul>
</blockquote>

<p>When we explore what happens if we don’t fully implement Raft, we will refer to these failure modes very often since that’s when interesting things happen.</p>

<h2 id="what-does-the-system-do">What Does the System Do</h2>

<p>Paxos only discusses the consensus of a single value in a distributed system. But as the Raft paper noted, it’s not very practical in real-world systems. There are algorithms that leverage Paxos, like Multi-Paxos, but they are hard to understand. So here we will follow Raft’s direction to describe a system that guarantees each node has a consistent view of logs. The logs are like write-ahead logs of databases like PostgreSQL and MySQL. Once each node has them and they are guaranteed to be consistent, they can just apply the logs to have a consistent state.</p>

<p>Raft elects a leader to answer all client requests to simplify the algorithm. Log replication is from the leader to other nodes.</p>

<p>So in the following section, we will discuss how it will affect the consistency of logs across different nodes if we miss some parts of the Raft algorithm. We will use PostgreSQL as an example when discussing it, since its write-ahead log (WAL) is like Raft logs but missing some information. We will see how this difference will make it hard to implement a linearizable, highly available PostgreSQL cluster.</p>

<h2 id="no-guarantee-of-a-single-leader">No Guarantee of A Single Leader</h2>

<p>An important thing to notice before we go further is that there is no guarantee that only a single node is the leader at any given time. More specifically, there is no guarantee some operations are only done by a single node at a given time. Let’s consider the following code for operations the leader does:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (is_leader()) {
  do_something()
}
</code></pre></div></div>

<p>As described above for the non-Byzantine model, a process may pause for arbitrarily long. So if the process is paused after <code class="language-plaintext highlighter-rouge">is_leader()</code> returns true, in order for the system to still be available, another leader should be elected. But when the old leader resumes and comes back, it thinks it’s still the leader since from its perspective, it just checked that. You can add more checks before <code class="language-plaintext highlighter-rouge">do_something()</code>, like checking the time passed since <code class="language-plaintext highlighter-rouge">is_leader()</code> and so on, but the pause can always happen after all the checks and just before <code class="language-plaintext highlighter-rouge">do_something()</code>.</p>

<p>Not only can process pauses cause issues. If you are doing any network requests in <code class="language-plaintext highlighter-rouge">do_something</code>, because of network latency, when the requests reach the other node, a new node may be elected as a leader by then.</p>

<p>This is not only something that can happen in theory. Some languages have built-in GC that are known to have long GC pauses under some scenarios. And the operating system can also be too busy to schedule the process back to the CPU. If the process is running in a virtual machine, the host can also pause the virtual machine. To try to resolve it, some systems delay the new leader election when the old one loses contact, in the hope that the process pause or the network delay would be over by then. However, there is really no guarantee of the length of a process pause. So the delay of leader election only makes the possibility smaller, but not zero.</p>

<h2 id="a-naive-implementation-with-existing-raft-system">A Naive Implementation with Existing Raft System</h2>

<p>The most naive implementation may be just leveraging an existing consensus system like etcd, which implements the Raft algorithm. However, with this approach, we get the same problem as described in the last section. Let’s say you use etcd to do an <code class="language-plaintext highlighter-rouge">is_leader</code> check every time you want to do something that only the leader can do: the process can pause after the check, so there is really no guarantee of a single leader, even when leveraging an existing system that implemented the Raft algorithm.</p>

<h2 id="leader-terms">Leader Terms</h2>

<p>Raft and similar systems resolve the problem by having a leader term for each new election. The leader terms are consecutive integers. And when the leader sends a request to replicate logs, it sends its current term with the request. On the other node, it can compare the highest term it knows. If the term it knows is higher than the one in the request, it means the request is sent by an old leader, thus it will reject the request.</p>

<p>In terms of using an external system like etcd as described in the last section, we can use the leader term to guard operations. In etcd, there are revisions for a key, which is basically the same as leader terms. For example, if <code class="language-plaintext highlighter-rouge">do_something()</code> above is updating some rows in a database, you can check if the current term is the same as the term in the database. Here is a code example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>key = get_etcd_leader_key()
if (key != null) {
  revision = key.revision
  begin_db_transaction()
  if (get_revision_in_db() &lt;= revision) {
    do_something();
    update_revision_in_db(revision);
  }
  end_db_transaction();
}
</code></pre></div></div>

<p>Note how <code class="language-plaintext highlighter-rouge">get_revision_in_db</code>, <code class="language-plaintext highlighter-rouge">do_something</code> and <code class="language-plaintext highlighter-rouge">update_revision_in_db</code> happen in the same database transaction. Essentially, we are using another consistent system, the database, combined with the leader terms to make sure there is a single leader doing the operation.</p>

<h2 id="highly-available-postgresql-cluster">Highly Available PostgreSQL Cluster</h2>

<p>But what if we are implementing a distributed database system? There is no external database for us to depend on. For example, let’s say we are implementing a highly available cluster for PostgreSQL. Is there a way to use an external Raft system like etcd to implement a PostgreSQL cluster so that it meets the properties we discussed in the section “Goal of the System”?</p>

<p>PostgreSQL has tools to create replication and guarantee the transaction is only considered committed when it already replicates to a specific number of replicas. (Even though the implementation also has problems, but we will ignore them for now. See the section “A Known Issue of PostgreSQL Replication” in <a href="/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html">my previous article</a> for details). But it doesn’t have an auto-failover mechanism. So one needs to implement that part, and it’s actually very tricky.</p>

<p>One may think that with an existing Raft system like etcd, each node can have a process to monitor if it’s still the leader. And the nodes can reset the replication configuration based on who is the leader told by etcd.</p>

<p>However, since we’ve already seen from the sections above, we cannot simply just check who is the current leader. Otherwise, two nodes may very well think they are both the leader at the same time. So how should we resolve the problem?</p>

<h2 id="quorum">Quorum</h2>

<p>A way to reach consensus is to write a log entry to a majority of the nodes. Let’s say the log has an index along with its content, where the index is the position of the log and they are consecutive integers (no holes in the log entries). When a primary wants to replicate a log entry at index N, the replicas only accept it if index N is their next local index entry. With this, no matter who is the primary, only one log entry will be persisted on a majority of nodes for the same index N. And when we read data, we can also read from a quorum to make sure the data is on a majority of nodes when we return it to the client.</p>

<p>However, it has problems when we think about more than one log entry. Consider the following timeline of events:</p>

<ol>
  <li>We have 5 nodes A, B, C, D, E. Both A and C think they are the leader.</li>
  <li>A tries to replicate (index=1, content=X). At the same time, C tries to replicate (index=1, content=Y).</li>
  <li>A wins A, B while C wins C, D, E. So the current state is A, B has [X] while C, D, E has [Y]. But since the majority of nodes have [Y], we can ignore the version history of [X].</li>
  <li>A then tries to replicate the next log entry (index=2, content=X) even though it doesn’t get a majority from the last replication (more on this later). C tries to replicate the next log entry too with (index=2, content=Y).</li>
  <li>This time A wins a majority with A, B, D while C wins C, E. So the state becomes:
    <ul>
      <li>A: [X, X]</li>
      <li>B: [X, X]</li>
      <li>C: [Y, Y]</li>
      <li>D: [Y, X]</li>
      <li>E: [Y, Y]</li>
    </ul>
  </li>
</ol>

<p>We can see at step 5, there is no log history that has a majority.</p>

<p>You may have some questions about step 2: why does A try to replicate the next log entry even though it doesn’t get acked by a majority for the previous one? This is a normal performance optimization. Otherwise, each transaction needs to wait for the previous one to finish, which is very slow.</p>

<p>But let’s say for the sake of correctness, a node only replicates the next log entry when a majority acked its previous replications, problems can still happen:</p>

<ol>
  <li>We have 5 nodes A, B, C, D, E. Both A and C think they are the leader.</li>
  <li>A tries to replicate (index=1, content=X). At the same time, C tries to replicate (index=1, content=Y).</li>
  <li>A wins A, B while C wins C, D, E. So the current state is A, B has [X] while C, D, E has [Y]. But since the majority of nodes have [Y], we can ignore the version history of [X].</li>
  <li>However, let’s say C then starts to replicate (index=2, content=Z). A, B have no reason to reject it since index 2 is after their latest index 1. So A, B has [X, Z] while C, D, E has [Y, Z].</li>
</ol>

<p>From a log entries point of view, we still maintain a log history within a majority of the nodes, which is [Y, Z]. But the problem is we need to apply the logs to state. When this happens, the internal state of A, B is different from the internal state of [Y, Z]. When we say check majority when reading data, we cannot check the whole log history of each node or the whole state of each node, otherwise the performance would be very bad.</p>

<p>But let’s ignore the performance again and compare the whole log history or state when reading—problems can still happen during a failover:</p>

<ol>
  <li>We have 5 nodes A, B, C, D, E. Both A and C think they are the leader.</li>
  <li>A tries to replicate (index=1, content=X). At the same time, C tries to replicate (index=1, content=Y).</li>
  <li>A wins A, B while C wins C, D, E. So the current state is A, B has [X] while C, D, E has [Y]. But since the majority of nodes have [Y], we can ignore the version history of [X].</li>
  <li>Node C failed and a new election needs to happen. The current state is:
    <ul>
      <li>A: [X]</li>
      <li>B: [X]</li>
      <li>C: failed so unknown.</li>
      <li>D: [Y]</li>
      <li>E: [Y]</li>
    </ul>
  </li>
</ol>

<p>So none of the log histories has a majority. It’s impossible to know which one to pick in this scenario. My previous blog post about <a href="/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html">Jepsen Test on Patroni</a> shows this flaw at some level: In the section “Failed to Recover the Cluster When Only 1 Out of 3 Nodes is Lost”, we can see it failed to recover even only 1 node is lost. The implementation of Patroni itself has some problems that doesn’t really enforce the writes to a real majority. But even with the majority rule enforced, we can see it still cannot be sure which log history to go forward under this scenario.</p>

<h2 id="save-the-last-log-entry-to-external-raft-system">Save the Last Log Entry to External Raft System</h2>

<p>What if we save the last log entry, including the log index and (the hash of) its content, to an external Raft system like etcd, and only consider it to be committed if it can successfully do so? So that when a new leader needs to be elected, it checks its eligibility by checking whether it has the latest committed log entry saved in etcd. Here are some more details to guarantee the correctness:</p>

<ul>
  <li>Only the leader can accept queries. It replicates log entries to a majority of nodes, then saves it to etcd. The log entry is only considered committed if those two operations succeed.</li>
  <li>When saving the latest log entry to etcd, it needs to check it’s still the leader in the same transaction, so that an older save request wouldn’t override a newer one.</li>
  <li>The leader should also tell the replicas the log entry is committed, so that the replicas can check with etcd and also commit it locally. If the entry in etcd is already larger than the latest one locally, it should ask the leader to sync the log entries.</li>
  <li>When a replica receives a replication request, it only appends it to its committed local entries. And the index of the new entry should be just behind its committed last entry. If the new index number is lower, it rejects it. If the new index number is higher than 1, it needs to ask the leader to sync its log entries.</li>
  <li>When a node wants to become a leader, it needs to check it has the latest log entry in etcd and acquire the leader key in the same etcd transaction.</li>
  <li>When reading data, it needs to confirm with etcd that the node has the latest log entry locally.</li>
  <li>The new leader should try to force sync its logs to other replicas to resolve conflicts. The replica needs to guard the sync request with the leader term: only accept requests with leader terms larger than the one stored locally. Otherwise, an old sync request may override newer committed log entries, which makes the committed log entries not replicated to a majority of nodes.</li>
</ul>

<p>You can see the algorithm is already complex. The complexity is even comparable to Raft, if not more so. Even with this complexity, I don’t have proof it is correct. At least I cannot find any problem for now. However, this approach comes with a performance penalty. Every log entry is essentially replicated to 2 systems: one for the system itself and one for the external Raft system. It will double the latency.</p>

<p>For systems like PostgreSQL, you cannot force it to replicate the write-ahead logs one by one without modifying its internals. The same applies for guarding the replication with an etcd check. So it’s really hard to implement this for a highly available PostgreSQL cluster, even not considering the performance issues.</p>

<h2 id="leader-terms-in-logs">Leader Terms in Logs</h2>

<p>The Raft algorithm resolves the problems by also including leader terms in the log entry and having guards around that when replicating log entries and when failing over. The paper already talked about it in detail, so I’ll not repeat it again here. The point is, without the leader terms in the log structure, it’s really tricky to make such a consensus system. We’ll talk more about it in the last section.</p>

<h2 id="linearizable-guarantee-for-reads">Linearizable Guarantee for Reads</h2>

<p>One thing that can be overlooked is the leader’s behavior for guaranteeing linearizable reads. Maybe because it’s discussed in a later section (Client interaction) in the Raft paper. There are 2 problems about the leader having stale data:</p>

<p>First, the Raft algorithm sends the last committed log ID after replicating the log entries in separate requests. So when a new leader is elected, it may have log entries that are already considered committed but are not known to the new leader yet. So the new leader needs to do a no-op log entry replication. During such replication, based on the Raft algorithm’s logic, the last committed log ID will be synced and caught up.</p>

<p>Then, when reading data from a Raft system, a node may think it’s still the leader but in fact there is a newer leader already elected. So before processing the read request, it needs to confirm it’s still the leader. The Raft algorithm does this by checking heartbeats between the leader and replicas.</p>

<p>Even etcd didn’t implement the linearizable reads at first. You can check more details from <a href="https://aphyr.com/posts/316-jepsen-etcd-and-consul">this Jepsen test</a>.</p>

<p>Using PostgreSQL as an example again, let’s say we somehow make the log replications to be consistent across the nodes, but we still need to check the leader’s role on every read. This is not possible without having a layer before the actual PostgreSQL transaction, either through PostgreSQL extensions or through a proxy.</p>

<h2 id="conclusion">Conclusion</h2>

<p>We talked about some different implementations for a consensus system and how things can go wrong with those implementations. We can see without an overall consideration, how hard it can be to implement such a system. Using PostgreSQL as an example, even though it has a built-in replication mechanism, without an overall design considering distributed consistency and failover, it’s hard to implement a linearizable system without changes to its internals.</p>

<p>But a linearizable system may not always be needed. Sometimes we may want a less strict model, like only guaranteeing read-your-own-writes. It still needs significant considerations tho. Maybe that can be a topic for future discussion.</p>]]></content><author><name></name></author><category term="distributed system" /><category term="raft" /><category term="paxos" /><category term="etcd" /><category term="postgresql" /><summary type="html"><![CDATA[There are some battle-tested consensus algorithms like Paxos and Raft to ensure a consistent view will be reached in a distributed system, even with scenarios like node failure, network partition, clock skew, and so on. There are existing systems like etcd and Zookeeper that implement these algorithms so you can use them as external systems. But sometimes, you may need to embed the algorithm into your own system instead of relying on a third-party one (it may be more common than you think—more on that in my next blog). Sometimes, because of the limitations of existing systems or other reasons, it may be hard to fully implement the algorithm, so people may tend to cut some corners and think it will be okay. While Raft and Paxos are not the only consensus algorithms, creating new ones is error-prone. This article explores how partial implementations fail, even when leveraging existing Raft systems incorrectly. We will use Raft as an example instead of other algorithms like Paxos, since it’s easier to understand and has better descriptions of real-world systems.]]></summary></entry><entry><title type="html">Improve Books Section of My Blog</title><link href="https://www.binwang.me/2025-07-02-Improve-Books-Section-of-My-Blog.html" rel="alternate" type="text/html" title="Improve Books Section of My Blog" /><published>2025-07-02T00:00:00-04:00</published><updated>2025-07-02T00:00:00-04:00</updated><id>https://www.binwang.me/Improve-Books-Section-of-My-Blog</id><content type="html" xml:base="https://www.binwang.me/2025-07-02-Improve-Books-Section-of-My-Blog.html"><![CDATA[<p>In my blog, I have a page called <a href="/read.html">Read</a> to keep the books I’ve read. At first, I used an API from <a href="https://book.douban.com">Douban</a> (a Chinese social media that has reviews and metadata of books, movies, TV shows). I was really happy about it since it’s easy to find and add books into a list on Douban, while still being able to render them in a custom layout. But a few years ago (cannot remember the exact time), the API became unavailable without an API key, and it’s impossible to find a way to get one. So I moved my reading list to Goodreads.</p>

<p>Instead of using an API to render the books with a custom layout, I linked the page directly to Goodreads’ book collection I created. It was the fastest way to recover the page. Goodreads has a few pain points including the lack of info for some Chinese books. But it was acceptable to me. I thought I could do the custom render with API in the future, only waiting for Goodreads to close its public API as well. Since then, I always want to migrate the page into something else, but didn’t really have the time until recently.</p>

<p>You should already be able to see the finished work when clicking the “Read” link at the top of the page. In this article, I’ll talk about how it’s implemented and the thoughts behind it.</p>

<h2 id="script-to-download-book-information">Script to Download Book Information</h2>

<p>Instead of migrating to another platform and use APIs to render the page, I think it’s better to take control of the data this time. So I wrote a Scala script to download the book information from websites like Goodreads and Douban. It saves the information into a YAML header of a Markdown file, which is a format Jekyll can read and use in a template. Here is an example of the output:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">layout</span><span class="pi">:</span> <span class="s">book</span>
<span class="na">title</span><span class="pi">:</span> <span class="s2">"</span><span class="s">When</span><span class="nv"> </span><span class="s">America</span><span class="nv"> </span><span class="s">First</span><span class="nv"> </span><span class="s">Met</span><span class="nv"> </span><span class="s">China:</span><span class="nv"> </span><span class="s">An</span><span class="nv"> </span><span class="s">Exotic</span><span class="nv"> </span><span class="s">History</span><span class="nv"> </span><span class="s">of</span><span class="nv"> </span><span class="s">Tea,</span><span class="nv"> </span><span class="s">Drugs,</span><span class="nv"> </span><span class="s">and</span><span class="nv"> </span><span class="s">Money</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">Age</span><span class="nv"> </span><span class="s">of</span><span class="nv"> </span><span class="s">Sail"</span>
<span class="na">authors</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">Eric</span><span class="nv"> </span><span class="s">Jay</span><span class="nv"> </span><span class="s">Dolin"</span><span class="pi">]</span>
<span class="na">isbn</span><span class="pi">:</span> <span class="s2">"</span><span class="s">9780871404336"</span>
<span class="na">cover</span><span class="pi">:</span> <span class="s2">"</span><span class="s">https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1337692965i/13812169.jpg"</span>
<span class="na">year</span><span class="pi">:</span> <span class="m">2012</span>
<span class="na">external_links</span><span class="pi">:</span>
  <span class="na">goodreads</span><span class="pi">:</span> <span class="s">https://www.goodreads.com/book/show/13812169-when-america-first-met-china</span>
  <span class="na">date_read</span><span class="pi">:</span> <span class="s2">"</span><span class="s">2025-06-22"</span>
<span class="nn">---</span>
</code></pre></div></div>

<p>The script is available in <a href="https://github.com/wb14123/blog/blob/master/add-book/add-book.sc">my blog’s repo</a>. With the command like</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./add-book/add-book.sc &lt;book_url&gt; jekyll/_books
</code></pre></div></div>

<p>It can download the book info files into the target directory <code class="language-plaintext highlighter-rouge">jekyll/_books</code>. Currently it can take a URL of a single book from Goodreads, or a book collection link from Goodreads, or a single book link from Douban.</p>

<h2 id="render-book-list-with-jekyll">Render Book List with Jekyll</h2>

<p>Once the book information is put into the blog’s folder, it’s easy to let Jekyll recognize it and render it.</p>

<p>First let’s add the folder as a collection in <code class="language-plaintext highlighter-rouge">_config.yaml</code>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">collections</span><span class="pi">:</span>
  <span class="na">books</span><span class="pi">:</span>
    <span class="na">output</span><span class="pi">:</span> <span class="kc">true</span>
</code></pre></div></div>

<p>After this you can use the variable <code class="language-plaintext highlighter-rouge">site.books</code> to get all the files under <code class="language-plaintext highlighter-rouge">_book</code> directory. And you can use each field in the file like <code class="language-plaintext highlighter-rouge">title</code> and <code class="language-plaintext highlighter-rouge">authors</code> as well. The template for the “Read” page is in Github repo’s <a href="https://github.com/wb14123/blog/blob/master/jekyll/read.html">read.html</a>.</p>

<p>With <code class="language-plaintext highlighter-rouge">output</code> set to <code class="language-plaintext highlighter-rouge">true</code>, it will also create a page for each of the books. If only showing the list of books is needed, <code class="language-plaintext highlighter-rouge">output</code> above can be set to <code class="language-plaintext highlighter-rouge">false</code>. But this option gives me a new idea: why not also write my reading notes in each book’s page?</p>

<h2 id="book-reading-notes">Book Reading Notes</h2>

<p>I’ve long wanted to write some notes on the books I’ve read on my blog. But most of the time it’s just not rich enough to become an article. I was also really hesitant to write it on third-party websites like Goodreads or Douban. So a lot of the thoughts just got lost.</p>

<p>With this new script and folder format, it gives me a new idea to write the book notes in a separate section of my blog, separated from my main blog posts, much like what I have done for the <a href="/snippets">Snippets</a> page which are shorter random posts powered by Mastodon.</p>

<p>It’s very easy to do that technically: just write the notes in the same file after the <code class="language-plaintext highlighter-rouge">---</code> separator of metadata. With <code class="language-plaintext highlighter-rouge">book.output</code> set to <code class="language-plaintext highlighter-rouge">true</code> like in <code class="language-plaintext highlighter-rouge">_config.yaml</code> above, it’s just a matter of adding a new layout template like <a href="https://github.com/wb14123/blog/blob/master/jekyll/_layouts/book.html">_layout/book.html</a> to render the notes pages.</p>

<p>I don’t think it’s worth to write notes for every book. So in the read list, I only add a link to the note when there is one. The individual book pages will still be created by Jekyll even without note content but I’m okay with it since the users don’t see them. I also show how long the note is in the book list, so that it can save the visitors some time if they only care about in-depth notes.</p>

<p>I don’t have the interest to backfill the notes for all the books I’ve read, so I only did it for the recent ones. I was thinking about to just write a few sentences for each book, but it surprised me how much and how in-depth I’ve written at the end. It’s also very fun to write since it’s very casual. It makes me realize I truly lost a lot of thoughts by not having a place to write book notes. Hopefully I can read and write more in the future and provide more insights to other readers.</p>

<h2 id="other-considerations">Other Considerations</h2>

<p>I thought about how the visitors of my blog can find the updates of the book notes. I don’t think it deserves to be listed in the index page (currently has all the posts in published time order) since it makes the writing less casual. Probably the best option is to have a separate RSS feed for them.</p>

<p>I’m also considering how to best organize the books I’m reading, and books I’ve read a little bit but abandoned. It’s valuable to keep notes on the book when I’m still reading but not sure if it has value to be shown here on the website. Maybe it’s a place for my private notes and I can publish it once I finished the reading. For the books abandoned, maybe I can keep a special mark in the book info and has some short comments in the notes.</p>]]></content><author><name></name></author><category term="blog" /><category term="book" /><category term="reading" /><category term="life" /><summary type="html"><![CDATA[In my blog, I have a page called Read to keep the books I’ve read. At first, I used an API from Douban (a Chinese social media that has reviews and metadata of books, movies, TV shows). I was really happy about it since it’s easy to find and add books into a list on Douban, while still being able to render them in a custom layout. But a few years ago (cannot remember the exact time), the API became unavailable without an API key, and it’s impossible to find a way to get one. So I moved my reading list to Goodreads.]]></summary></entry><entry><title type="html">Use OpenAPI Instead of MCP for LLM Tools</title><link href="https://www.binwang.me/2025-04-27-Use-OpenAPI-Instead-of-MCP-for-LLM-Tools.html" rel="alternate" type="text/html" title="Use OpenAPI Instead of MCP for LLM Tools" /><published>2025-04-27T00:00:00-04:00</published><updated>2025-04-27T00:00:00-04:00</updated><id>https://www.binwang.me/Use-OpenAPI-Instead-of-MCP-for-LLM-Tools</id><content type="html" xml:base="https://www.binwang.me/2025-04-27-Use-OpenAPI-Instead-of-MCP-for-LLM-Tools.html"><![CDATA[<p><a href="https://modelcontextprotocol.io/introduction">Module Context Protocol (MCP)</a> is adopted by more and more people as a way to integrate tools to LLMs easily. However, I find it unintuitive and unnecessarily complex. So in this article, I’ll explore how to use existing <a href="https://www.openapis.org/">OpenAPI</a> servers as tools of LLMs instead of writing the functions in a completely new protocol. This can potentially become a simpler standard that only needs implementing an additional authentication flow (if authentication is needed).</p>

<p>For the people who are not familiar with OpenAPI, it’s a formal way to describe HTTP APIs. You may have heard of Swagger which is basically the same thing. Lots of HTTP framework support it so that you can generate structured document in JSON or YAML format, and view it in tools like <a href="https://editor.swagger.io/">Swagger Editor</a>. Because of this structured document, it’s perfect to be fed into LLM as tool definitions.</p>

<p>The final result is in the repo <a href="https://github.com/wb14123/ai-tool-proto-experiment">ai-tool-proto-experiment</a>. It’s a single file Scala script with less than 300 lines of code.
It doesn’t use any LLM SDK, just simple HTTP calls to the LLM providers. It doesn’t use any advanced API either. Only chat completion API with structured output is needed.</p>

<h2 id="goals-and-non-goals">Goals and Non-Goals</h2>

<p>Tool server is only part of MCP. It has more features like prompts and resources. Personally I don’t see much benefit to include so many use cases in a single protocol.</p>

<p>For example, prompts are just a server API that you can get all the pre-defined prompts. That’s very easy to implement with any protocol and there is really no need to combine it with the LLM tool protocol.</p>

<p>So in this article, we will only explore how to integrate other services as tools of LLM, without caring about the other parts of MCP like prompts and resources.</p>

<p>There are also some other things that MCP doesn’t resolve, like security. <a href="https://elenacross7.medium.com/%EF%B8%8F-the-s-in-mcp-stands-for-security-91407b33ed6b">This post</a> summarized lots of security problems in MCP, and I don’t think there is an easy way around that even if we use existing protocols like OpenAPI. So the goal in this experiment is only to use trusted OpenAPI servers, without worrying about attacks like tool shadowing. With that said, authentication is still necessary to the OpenAPI server, which is a protection of the server instead of the client. MCP only added authentication into the spec recently. As you can see later, the authentication workflow I tried here is much simpler and generic.</p>

<p>At last, use as little LLM API as possible is also a goal, so that it’s easier to port the implementation to other LLM providers.</p>

<h2 id="the-implementation">The Implementation</h2>

<p>Using something like OpenAPI is not a new idea. I’ve seen multiple people mentioned it on places like HackerNews. And during my implementation, I also found <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, a tool I self-host and use daily which also added support to use OpenAPI servers as tools. Nevertheless, I still try to experiment my own implementation because I want to keep it as simple as possible, and also learn more details about the capability of such approach.</p>

<p>In the experiment, I tried both a simple <a href="https://github.com/open-webui/openapi-servers/tree/main/servers/weather">open source weather OpenAPI server</a>, and my own project <a href="https://www.rssbrain.com/">RSS Brain</a>. I’ll try to explain how it is implemented and talk about an experiment result at the end.</p>

<h2 id="define-the-tool-calling-structure">Define the Tool Calling Structure</h2>

<p>Lots of LLM providers support tool calling APIs. We will avoid using those APIs just to keep things simpler and make it more generally available for other LLMs, including the self hosted ones. So instead, we define our own JSON schema that we want the LLM to follow and feed it as part of the system prompt, also use structured output API to enforce the LLM response follow the JSON schema. I said in the beginning that I want to use as few features as possible, but I think structured output is an important enough feature I need to use in addition to the basic chat completion. Fortunately lots of other LLMs including the local ones, like Ollama, also support this feature.</p>

<p>Here are the response structure we want, in the format of Scala class definition:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">case</span> <span class="k">class</span> <span class="nc">ToolParam</span><span class="o">(</span>
    <span class="n">httpRequestEndpoint</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
    <span class="n">httpRequestPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
    <span class="n">httpRequestHeaders</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]],</span>
    <span class="n">httpRequestMethod</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
    <span class="n">httpPostBody</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">String</span><span class="o">],</span>
<span class="o">)</span>

<span class="k">case</span> <span class="k">class</span> <span class="nc">ChatResponse</span><span class="o">(</span>
    <span class="n">callTool</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">ToolParam</span><span class="o">]</span> <span class="k">=</span> <span class="nc">None</span><span class="o">,</span>
    <span class="n">toUser</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">None</span><span class="o">,</span>
<span class="o">)</span>
</code></pre></div></div>

<p>The LLM should either respond to the user directly using <code class="language-plaintext highlighter-rouge">toUser</code> field, or ask the agent to call a HTTP API with <code class="language-plaintext highlighter-rouge">callTool</code> field. You can see the <code class="language-plaintext highlighter-rouge">ToolParam</code> definition is pretty generic: it can basically do any HTTP call.</p>

<p>For OpenAI, the structured output API only accept a subset of JSON schema definition. So instead of converting the structure to a JSON schema with a single line of Scala code, I need to manually write the OpenAI compatible one.</p>

<p>I also find OpenAI models, gpt-4o-latest at least, often fail to generate responses that meet the structure requirement even when structured output is enabled. You still need to include the JSON schema into the system prompt to get the best chance.</p>

<p>Overall, here is the system prompt to let the system use the tools:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">val</span> <span class="nv">systemPrompt</span><span class="k">:</span> <span class="kt">String</span> <span class="o">=</span> <span class="o">{</span>
  <span class="k">val</span> <span class="nv">timeStr</span> <span class="k">=</span> <span class="nv">ZonedDateTime</span><span class="o">.</span><span class="py">now</span><span class="o">().</span><span class="py">format</span><span class="o">(</span><span class="nv">DateTimeFormatter</span><span class="o">.</span><span class="py">ISO_ZONED_DATE_TIME</span><span class="o">)</span>
  <span class="n">s</span><span class="s">"""You are a helpful assistant.
     |
     |The current time is $timeStr.
     |
     |You have many tools to use by sending a http request to some API servers. Your response must be Json that
     |follows the Json schema definition:
     |
     |$chatResponseSchemaStr
     |
     |Either request a call to one of the APIs with `callTool` field, or
     |response to user directly with `toUser` field if there is no need to request to any tool or you need more
     |information from the user.
     |
     |Each tool has an optional authUrl that you can ask the user to open in the browser. If you get authentication
     | related errors when calling a tool, ask the user to open the authUrl in browser and copy the instruction back,
     | then use the instruction to try authentication again.
     |
     |Important:
     |
     |* Respond only the JSON body. Never quote the response in something like json```...```.
     |* Never respond to user directly without using the `toUser` field with a JSON response.
     |* Only one of `callTool` and `toUser` field should be filled.
     |* Always include the `http` or `https` part for the `httpRequestEndpoint` field.
     |
     |"""</span><span class="o">.</span><span class="py">stripMargin</span>
<span class="o">}</span>
</code></pre></div></div>

<p>You can see there are some extra points at the end, which are the cases I find the model hiccups very often.</p>

<h2 id="feed-the-tool-information-into-llm">Feed the Tool Information Into LLM</h2>

<p>Since OpenAPI can generate a structured document for the API server, either in JSON or YAML, we can feed the document directly to the LLM. In addition to the doc endpoint, we also need to provide the endpoint of the API servers, also an optional <code class="language-plaintext highlighter-rouge">authUrl</code> we will talk about later. Here is the definition of the tool in Scala classes, with the prompts:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">case</span> <span class="k">class</span> <span class="nc">ToolDef</span><span class="o">(</span>
    <span class="n">httpEndpoint</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
    <span class="n">openAPIPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
    <span class="n">authUrl</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">None</span><span class="o">,</span>
<span class="o">)</span> <span class="o">{</span>
  <span class="k">def</span> <span class="nf">prompt</span><span class="k">:</span> <span class="kt">String</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="nv">authUrlPrompt</span> <span class="k">=</span> <span class="nv">authUrl</span><span class="o">.</span><span class="py">map</span><span class="o">(</span><span class="n">url</span> <span class="k">=&gt;</span> <span class="n">s</span><span class="s">"Tool login URL: $url\n"</span><span class="o">).</span><span class="py">getOrElse</span><span class="o">(</span><span class="s">""</span><span class="o">)</span>
    <span class="n">s</span><span class="s">"""----
       |Tool server endpoint: $httpEndpoint
       |
       |$authUrlPrompt
       |Tool's OpenAPI definition:
       |$openAPIDef
       |
       |----
       |
       |"""</span><span class="o">.</span><span class="py">stripMargin</span>
  <span class="o">}</span>

  <span class="k">private</span> <span class="k">def</span> <span class="nf">openAPIDef</span><span class="k">:</span> <span class="kt">String</span> <span class="o">=</span> <span class="o">{</span>
    <span class="nv">requests</span><span class="o">.</span><span class="py">get</span><span class="o">(</span><span class="n">httpEndpoint</span> <span class="o">+</span> <span class="n">openAPIPath</span><span class="o">).</span><span class="py">text</span><span class="o">()</span>
  <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>After system prompt, the tools prompt is sent as the first chat message to the LLM with a role of <code class="language-plaintext highlighter-rouge">developer</code>. I find it works better than put it into the system prompt, maybe because of the tool definition sometimes can be too long:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">val</span> <span class="nv">tools</span> <span class="k">=</span> <span class="nc">Seq</span><span class="o">(</span>
  <span class="nc">ToolDef</span><span class="o">(</span><span class="n">httpEndpoint</span> <span class="k">=</span> <span class="s">"https://grpc-gateway.rssbrain.com"</span><span class="o">,</span> <span class="n">openAPIPath</span> <span class="k">=</span> <span class="s">"/swagger.json"</span><span class="o">,</span>
    <span class="n">authUrl</span> <span class="k">=</span> <span class="nc">Some</span><span class="o">(</span><span class="s">"http://app.rssbrain.com/login?redirect_url=/llm_auth"</span><span class="o">)),</span>
<span class="o">)</span>
<span class="k">val</span> <span class="nv">toolsPrompt</span> <span class="k">=</span> <span class="nv">tools</span><span class="o">.</span><span class="py">map</span><span class="o">(</span><span class="nv">_</span><span class="o">.</span><span class="py">prompt</span><span class="o">).</span><span class="py">mkString</span><span class="o">(</span><span class="s">"\n"</span><span class="o">)</span>

<span class="k">val</span> <span class="nv">req</span> <span class="k">=</span> <span class="nc">ChatRequest</span><span class="o">(</span>
  <span class="n">messages</span> <span class="k">=</span> <span class="nc">Seq</span><span class="o">(</span>
    <span class="nc">ChatMessage</span><span class="o">(</span><span class="n">role</span> <span class="k">=</span> <span class="s">"system"</span><span class="o">,</span> <span class="n">content</span> <span class="k">=</span> <span class="n">systemPrompt</span><span class="o">),</span>
    <span class="nc">ChatMessage</span><span class="o">(</span><span class="n">role</span> <span class="k">=</span> <span class="s">"developer"</span><span class="o">,</span> <span class="n">content</span> <span class="k">=</span>
      <span class="n">s</span><span class="s">"""
        |
        |Here are the OpenAPI definition of the tools:
        |
        |$toolsPrompt
        |
        |"""</span><span class="o">.</span><span class="py">stripMargin</span><span class="o">),</span>
  <span class="o">),</span>
<span class="o">)</span>
<span class="nf">loop</span><span class="o">(</span><span class="n">req</span><span class="o">,</span> <span class="nc">None</span><span class="o">,</span> <span class="n">waitForUser</span> <span class="k">=</span> <span class="kc">true</span><span class="o">)</span>
</code></pre></div></div>

<h2 id="authentication">Authentication</h2>

<p>As you can see from the system prompt above:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Each tool has an optional authUrl that you can ask the user to open in the browser. If you get authentication
related errors when calling a tool, ask the user to open the authUrl in browser and copy the instruction back,
then use the instruction to try authentication again.
</code></pre></div></div>

<p>We actually take advantage of the flexibility of LLMs for our authentication flow: we define a <code class="language-plaintext highlighter-rouge">authUrl</code> for a tool server, which user can open in the browser. The URL will do the necessary authentication flow, then return a natural language description about credentials and how to use it to do authentications with the APIs.</p>

<p>Ideally, the natural language instruction should be passed to the client in a secure way, for example, through callback to a local URL that the client serves. But for the simplicity of the experiment, I just ask the user to copy the instruction back to the conversation.</p>

<p>So here is what it looks like in the example below:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User input: Get all my RSS folders
Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
</code></pre></div></div>

<p>The user asks for the RSS folders, so LLM response with a <code class="language-plaintext highlighter-rouge">callTool</code> action. When trying to call the http API, it returns an error about authentication. We feed the result back to the LLM, then it responses to the user:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Assistant: It seems like your request for fetching RSS folders requires authentication. Please log in to your RSS Brain account and provide the token to proceed. You can open [this login page](http://app.rssbrain.com/login?redirect_url=/llm_auth) to login and obtain the necessary token.
</code></pre></div></div>

<p>You can see the LLM is asking the user to open a URL in the browser. When the user opens this URL in the browser, the service will prompt the user to login and redirect the user to a page with natural language instruction for the LLM. The user copies the instruction back to the chat:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User input: Use `token` param in the APIs to do authentication. Your current token is `XXXXX`.
</code></pre></div></div>

<p>After adding the user input to the LLM, the LLM now knows how to fill in <code class="language-plaintext highlighter-rouge">callTool</code> params with necessary authentication info and the call finally succeeds:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
Assistant: Here are your RSS folders:
...
</code></pre></div></div>

<p>The authentication flow makes it very flexible: the tool server can basically implement any kind of authentication method as long as it provides a URL that contains the authentication instruction along with credentials. You can even create a third party auth server if the original OpenAPI server doesn’t provide this auth workflow, so that you can integrate any OpenAPI service with authentication.</p>

<h2 id="results">Results</h2>

<p>Okay, everything is ready now. Let’s try to use it. Follow the <a href="https://github.com/wb14123/ai-tool-proto-experiment">readme</a> to install the dependencies and run the script.</p>

<p>Here is a real world example of using RSS Brain as one of the tool server:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-&gt; ./ai-tool.sc
Compiling /home/wangbin/hobbi_source/ai-tool-proto-experiment/ai-tool.sc
User input: Hi
Assistant: Hello! How can I assist you today?
User input: Get all my RSS folders
Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
Assistant: It seems like your request for fetching RSS folders requires authentication. Please log in to your RSS Brain account and provide the token to proceed. You can open [this login page](http://app.rssbrain.com/login?redirect_url=/llm_auth) to login and obtain the necessary token.
User input: Use `token` param in the APIs to do authentication. Your current token is `XXXXX`.
Calling tool https://grpc-gateway.rssbrain.com//rss.FolderAPI/GetMyFolders ...
Assistant: Here are your RSS folders:

1. **Default Folder:**
   - Name: Default
   - Description: user default folder
   - Layout: LIST
   - Order: TIME

2. **news-local:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

3. **local-forum:**
   - Description: N/A
   - Layout: LIST
   - Order: SCORE

4. **news-cn:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

5. **cn-podcast:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

6. **cn-media:**
   - Description: N/A
   - Layout: GRID
   - Order: TIME

7. **news:**
   - Description: News around world
   - Layout: LIST
   - Order: TIME

8. **news-Ukraine:**
   - Description: N/A
   - Search Term: "Ukraine" AND "Russia"
   - Layout: LIST
   - Order: TIME

9. **blogs:**
   - Description: N/A
   - Layout: LIST
   - Order: TIME

10. **tech-blogs:**
    - Description: N/A
    - Layout: LIST
    - Order: TIME

11. **tech-news:**
    - Description: N/A
    - Layout: LIST
    - Order: TIME

... (and more)

For a complete list or details about any specific folder, please let me know!
User input: What happened today based on articles in local-news?    
Calling tool https://grpc-gateway.rssbrain.com//rss.ArticleAPI/GetArticlesByFolderWithUserMarking ...
Assistant: Here are some notable articles from today based on the 'local-news' folder:

1. **A short history of airport gold heists (because Toronto's isn't the first)**
   - **Source:** Canada - CBC News
   - **Description:** Reflection on Toronto's Pearson International Airport's gold heist and similar historical events.
   - [Read more](https://www.cbc.ca/news/world/airport-gold-heists-history-1.6817567)

2. **Family of Nunavut woman says she was used to get Ontario twins Inuit status**
   - **Source:** Canada - CBC News
   - **Description:** A story about a family's reaction after a woman is allegedly used in a fraud to gain Inuit status for twins.
   - [Read more](https://www.cbc.ca/news/canada/north/ontaro-twins-inuit-status-claims-1.6814782)

3. **CBC partners with Markham Library to offer youth Radio Camp**
   - **Source:** Toronto - CBC News
   - **Description:** A creative initiative by CBC for youth in Markham, involving sound exploration and radio production.
   - [Read more](https://www.cbc.ca/news/canada/toronto/community/cbc-partners-with-markham-library-to-offer-youth-radio-camp-1.6806360)

4. **Federal government says it will restore funding to Hockey Canada — with conditions**
   - **Source:** Canada - CBC News
   - **Description:** Funding for Hockey Canada is restored with stipulations after controversy over handling of claims.
   - [Read more](https://www.cbc.ca/news/politics/hockey-canada-federal-government-funding-restored-1.6812492)

5. **Man dies after being hit by tow truck in York Region industrial accident, police say**
   - **Source:** Toronto - CBC News
   - **Description:** Tragic incident of a man being fatally struck by a tow truck, with ongoing investigation.
   - [Read more](https://www.cbc.ca/news/canada/toronto/industrial-accident-york-region-tow-truck-death-1.6780204)

For more details, you can click on the links provided with each summary.
User input: Summarize the first article
Assistant: The first article is titled **"A short history of airport gold heists (because Toronto's isn't the first)"** from "Canada - CBC News". It discusses a recent heist that took place
 at Toronto's Pearson International Airport, where $20 million worth of gold and other valuables were stolen. This event is compared to similar high-profile historical heists at airports, highlighting
 recurring vulnerabilities in airport security regarding valuable shipments. The article provides context and details on how such heists have occurred before in different parts of the world.

</code></pre></div></div>

<p>Since it’s pretty expensive when including the OpenAPI doc, I didn’t try it much. But there are a few things learned:</p>

<ul>
  <li>Good document is really important. LLM doesn’t include some parameters if you don’t clarify why they’re needed. I think it’s actually a good thing, since good document is not only for machines, but also for other people.</li>
  <li>Structured output is not always enforced. The JSON schema also needs to be in the system prompt to make it more reliable.</li>
  <li>APIs need to be designed around the potential use cases. For example, the get articles from a feed API doesn’t have a time range since the UI of RSS Brain is infinite scroll. So it’s hard for LLM to get the articles in a specific time range, for example, last day.</li>
</ul>

<p>But overall, I think it’s safe to say with the additional authentication flow that can be added by third party, it’s easy to integrate any OpenAPI server as a tool to LLM agent without needing any new protocol.</p>

<h2 id="more-about-protocol">More About Protocol</h2>

<p>I want to add more details about choosing OpenAPI as the protocol. The obvious reason is it has been around for a long time and lots of tools and frameworks support it. But for RSS Brain, I only have a gRPC server. I implemented the gRPC server with my own library <a href="https://github.com/wb14123/scala2grpc">scala2grpc</a> and really like it. However, a downside of gRPC is that the serialized message is not human readable. So it’s hard to define a generic call tool structure. Also, it’s more difficult to call the API without generated gRPC code. So at the end, I used <a href="https://github.com/grpc-ecosystem/grpc-gateway/tree/main">grpc-gateway</a> to proxy the gRPC server into an OpenAPI server. The code is released in <a href="https://github.com/wb14123/rss_brain_grpc_gateway">Github</a>.</p>

<p>grpc-gateway has a tool to auto-generate OpenAPI structured documents, if you have the necessary comments in the gRPC protobuf files. So I added the feature in scala2grpc to add comments based on Scaladoc and it turned out wonderfully.</p>

<p>With all that said, I still think gRPC has its advantages because of the ability of bi-directional streaming. The proto definition is also easier to understand for humans, which is a plus for LLMs. Maybe with some more work and leveraging things like <a href="https://grpc.io/docs/guides/reflection/">gRPC reflection</a>, we can also define a generic tool calling action with gRPC in the agent.</p>

<h2 id="future-work">Future Work</h2>

<p>This is only an experiment, so there are lots of work left to do to make it better.</p>

<p>The first one is to have a better authentication flow, so that the auth URL can give the instructions to LLM agent without user copying it. This is not a hard thing to do but makes it much more secure.</p>

<p>Another thing to try is to not include all the OpenAPI doc in the prompt at first. Instead, only include a summary or description of it, and only give more details when the LLM agent asks for it. It should be able to save lots of cost.</p>

<p>And there can be more validations for security purpose, for example, when calling a tool or let user open an auth URL in the browser, verify the hostname matches what we have defined in the tools.</p>

<p>It’s also worth trying other protocols like gRPC since it supports bi-directional streaming, which may work better in some use cases.</p>]]></content><author><name></name></author><category term="machine learning" /><category term="LLM" /><category term="MCP" /><category term="OpenAPI" /><category term="gRPC" /><summary type="html"><![CDATA[Module Context Protocol (MCP) is adopted by more and more people as a way to integrate tools to LLMs easily. However, I find it unintuitive and unnecessarily complex. So in this article, I’ll explore how to use existing OpenAPI servers as tools of LLMs instead of writing the functions in a completely new protocol. This can potentially become a simpler standard that only needs implementing an additional authentication flow (if authentication is needed).]]></summary></entry><entry><title type="html">Travel Back To China: 2025 Edition</title><link href="https://www.binwang.me/2025-02-28-Travel-Back-To-China-2025.html" rel="alternate" type="text/html" title="Travel Back To China: 2025 Edition" /><published>2025-02-28T00:00:00-05:00</published><updated>2025-02-28T00:00:00-05:00</updated><id>https://www.binwang.me/Travel-Back-To-China-2025</id><content type="html" xml:base="https://www.binwang.me/2025-02-28-Travel-Back-To-China-2025.html"><![CDATA[<p>As I mentioned in the <a href="/2025-01-13-A-2-Year-Reflection.html">last year-end retrospective blog</a>, I traveled back to China again this year (to be more accurate, at the end of 2024). I migrated to Canada just before Covid, which is why I only started to go back since <a href="/2024-03-19-Travel-Back-to-China.html">last year</a> after the Covid measures were all lifted. I think it’s worth a blog post every time I go back there since China, as a country I spent most of my life, not only made what I am today, but also still has so many connections with me. Most of my friends and extended family members are still there. I also find going back every year instead of living there all the time makes me a better observer. I can notice the changes easier since I’m comparing the things from year to year instead of from day to day. I can notice the things that people living there all the time may have already gotten used to. I also have more motivation to observe since the short time of visiting is so valuable to me. So here it is, the 2025 edition of travelling back to China.</p>

<h2 id="toronto">Toronto</h2>

<p>My wife and I planned the travel dates in October. We decided to go back during the Christmas and new year’s holiday instead of Chinese New Year. Mostly because I can take advantage of some public holidays so I don’t need to take that many days off, and it’s also easier to coordinate with co-workers since most people are not there so it’s hard to do projects anyway. However, the flights are really expensive during the holiday season. Most of the tickets required multiple stopovers. Then we thought, why not take the opportunity to visit some places on the way? We stopped in Tokyo for a few days last year when we travelled back, so why not do something similar? We’ve always wanted to visit Europe. So this time, we went big with our plan to stay in London on the way to China, and Paris on the way back to Toronto. So the whole trip is like: Toronto -&gt; New York -&gt; London (stay) -&gt; Shanghai -&gt; Zhengzhou -&gt; My Hometown. Then for the return trip: My hometown -&gt; Beijing -&gt; Paris (stay) -&gt; New York -&gt; Toronto. Crazy, right? But travel to Europe with only additional expense of hotels is such an attractive idea, and the travel last year gave us lots of confidence in travelling with a baby.</p>

<p>Before I went back, I had just finished reading <em>Other River</em> by Peter Hessler. Then I started to read <em>Country Driving</em> again, another book of his written more than 10 years ago. Peter Hessler, as an American, most of his books are about his experience when staying in China, which I absolutely love reading. It’s an interesting coincidence that his coverage of China happens to fill the gaps of my own experience: he started his China experience from 1996, about 4 years after I was born. Then he left China in 2010, one year after I started my university life and started to experience and think about the Chinese society by myself. In 2019, the year I moved to Canada, he went back to China again, so his recordings helped me to fill the gap since then. It’s a shame he had to leave China again in 2021 (or lucky, I guess, considering the Covid measures in China after that). It’s really refreshing to read the perspective of someone from a different culture. Like I said above, it makes the person a better observer. His books also inspired me to write blogs like this to record things in China. Reading the books, it feels China is changing so fast. Even for his newest book <em>Other River</em>, it only covers the early stages of China’s Covid measurements, which seems already outdated after only 2 years. It seems to be impossible to grasp China as a whole. On that level, there is no difference for my blogs: it’s just some observation of a snapshot of China by a person.</p>

<p>Other than reading the books, I didn’t really have much plan for the trip. The work was busy because I needed to wrap up things before leaving for vacation. It’s flu season and combined with vaccines, my daughter gave the whole family a hard time. We only started to pack before two days of leaving, and only had a very rough plan for our travel in London. But that’s enough: a detailed plan is mostly useless when travelling with a baby. It’s better to adjust based on the situation and plan accordingly, which I’m already very good at.</p>

<h2 id="london">London</h2>

<p>After transferring through New York, we arrived in London in the morning. I was very excited since it’s my first time to be in Europe. We booked the hotel to be in the core of London, so that we can walk to most of the destinations. We took the train from Airport to the hotel. The train stopped before reaching our destination, which was annoying but understandable to me: I know the idea that some trains don’t run the full range and go back early. We took the next train coming, only to find out it was a different line. It turned out it was really a train station instead of a subway station, which can serve trains to multiple directions. When we realized it after one station, I overheard someone else also took the wrong train, who was also visiting London for the first time.</p>

<p>Anyway, with this small bump, we still arrived at the hotel very early in the morning and my daughter has already slept in the stroller. While it wasn’t time for check-in yet, we left the luggage at the hotel and headed to British Museum. We planned to visit the museum on the first day since the time spent there can be flexible: we know it’s impossible to see the whole collection in a single day, so we would just take whatever time was available. Luckily my daughter slept the whole morning, and we ate some (awful) food in the museum after she woke up. It’s really a rare opportunity to be able to see so many valuable collections, especially the ones from very early civilizations like Egypt and Assyria, which I always fancied, as well as some rare collections from China. Without realizing it, we spent much more time than we had planned, until it felt like a rabbit hole and it’s impossible to see all the things I want to see.</p>

<p>There were not so many people on the way when we walked from the hotel to the museum, and I thought that’s just how London is. However, when we went back, we found the streets were packed with people. The shops, restaurants, malls, trees, streets, basically everything, all had very beautiful decorations. The holiday spirit was really magical. I’ve never seen anything like that. Especially with the European style narrow roads, the whole city was like a gigantic Christmas market. At night, we walked in the main shopping areas and the lights are so gorgeous. It felt really magical and unreal.</p>

<p>Other than the holiday vibe, I also enjoyed the culture very much. There are so many civilizations that have long histories in the world, but the only one I’m very familiar with is Chinese culture and its influences in East Asia. I’ve been to some other countries like US and New Zealand, not to mention Canada, but those countries are all colonies where the newcomers don’t have a long history there while the native people’s culture has long been overshadowed. But in London, I got the opportunity to experience another totally different culture that has a long history and had great influence to the world. I visited places like Tower of London, Buckingham Palace, National Gallery and so on. But I had the most stunning experience at Westminster Abbey: I didn’t do much research for the travel, so before I went there, I just knew the architecture is beautiful and it’s one of the must see sites in London. But when I entered there, I was shocked to know the burial sites are actually <strong>in</strong> the abbey, just under and beside your steps. This burial style is unthinkable in Chinese culture. The sculptures are beautiful and gorgeous. It felt strange that there are so many great names buried in such a small space. It makes the grand church feel crowded. It’s like walking into a Hollywood party with all kinds of stars, except the people there are magnificently more famous and important: the kings, queens, scientists, writers and so on. The building truly represents the country’s culture and glory: the custom, the history, the religion, the art, and the people.</p>

<p>We hit another similar bump on the way back: we took the wrong train again when we went to the airport. There were so many people in the airport and it’s a total mess. Fortunately we left early for the airport and were able to take some priority line with my daughter, so we didn’t miss our flight. With that and a really long trip, we finally arrived China.</p>

<h2 id="hometown">Hometown</h2>

<p>Nothing reminds you how fast time passes more than a baby’s growth: last year when we went back, our daughter couldn’t even crawl. This year, she is running all over the place and speaks so many words. Watching her learn new things everyday, trying to talk and interact with all the family members brings so much joy to everyone. However, things were not so smooth at the beginning: when we first arrived, my daughter had a mild fever. Nothing too bad and she recovered just after one day. But when we took her to my mother’s place, she was still very tired from the disturbed sleep schedule and jet lag. It took her some serious cry before we could put her to sleep. The crying made my mother and my grand parents very worried. It also made me wonder if it’s worth spending so much time, effort and money to go back just to stay for less than 3 weeks. But after my daughter had enough rest and wake up, she played with the family so happily. On each of the visits, she tried to remember and babble the words that represent everyone, like grandmother and so on. During one of the visits, my grandfather told us that my uncle, who is in the US, cancelled the plan to go back this year. He said that with anger, which is normal since he is a serious man with a bad temper. But there was also sadness on his face that he tried to hide but failed to conceal. It’s that moment I decided I would go back every year as long as there are no events like a pandemic or a war.</p>

<p>Talking about a war seems exaggerated. But look at Ukraine and Palestine: this is the world we are living in. During my time there, a possible war between China and Taiwan has been brought up as a chat topic multiple times at different occasions. Even when I took a taxi, the driver was watching a video on Douyin (Chinese version of Tiktok) about attacking Taiwan (what’s wrong with the people nowadays that watch videos when driving??). The creator said in a very exited tone that if China attacks Taiwan, the war must be finished in one week no matter what the cost is, otherwise China will be screwed. The taxi driver agreed after watching the video, then out of blue, started to criticize Xi, talking about Xi’s bad education background and how his constitution amendment will make him infamous forever. Anyway, I think Xi wouldn’t start a war as long as he has strong control of China’s politic power. After all, why would he risk the power? It would only happen when the economy is so bad that there is civil unrest, or there is strong challengers to his rule (no one is on the horizon right now), or his old age blinds his judgement. But I also heard the rumor that says the war will happen before 2027, which is the 100th birthday of the Chinese PLA, which made me less certain about my thoughts. After all, it wouldn’t be the first irrational decision Xi has made. That’s how it works in China: without trustworthy media and transparent political practices, information is transferred through rumors and discussions are based on guesswork.</p>

<p>Most conversations like this happened at the dining table. I had high expectations about the food before I went back. Some food in my hometown are very hard to find in Toronto or even in other places of China, and they are so delicious in my memory. However, the food has been underwhelming this time. Maybe the expectation was too high. Back then when I was in high school, the same street had two high schools with less than 500 meters apart. The schools had a tight schedule: classes started before 8:00 in the morning and ended at about 9:00 at night. There is a larger gap in the noon because Chinese students usually take a nap during the noon. But there is a very small gap between afternoon and evening. So naturally, the street became a place that has lots of small restaurants and street food. It was packed with students when the afternoon classes were over, and the food stalls would occupy the roadway. It’s a mix of chaos and relaxing, which is so hard to forget. I spent almost every dinner there during the 3 years of my high school life. I would buy some video game magazines after dinner, or rent some books, or skip the dinner all together and spend the time to play video games in an Internet cafe.</p>

<p>It’s telling that all those industries about my after dinner activities have (almost) vanished. Let’s first address the weird thing if you’ve never heard of it: book rental. Yes, it was a very popular industry on that street back then. It’s very like movie rental. The books were mostly pirated novels from the Internet. The quality varies but usually on the lower side. It’s perfect for spending time during boring classes, which I did endlessly. But once the mobile phones are popularized in students, all the book rental stores are just gone: since the most popular books in those stores are on the Internet any way. Internet cafes are not so popular anymore but are not totally vanished: even with everyone has their own computer nowadays, it’s still useful for some friends play video games together. Video magazine is mostly the same story: there is less business but they are not totally gone. But the one I read the most in the past had closed the business. I only found it out many years later and it made me very sad.</p>

<p>Actually, not only the after dinner activities have vanished, even my high school was vanished all together as well: it’s moved to a suburb like location and the students must live there. The original site became a middle school.</p>

<p>On the other side, street food has lasted much longer. So like before, I went there again but only found the food stalls were all hidden in a nearby street. Curiously, I asked one of them why, and they told me there is an inspection from government for the award of “clean city”. This is typical campaign style governance: when an event like this happens, there will be a campaign across the city to make very absurd policies that makes normal life very inconvenient. The policies are also unsustainable, which is reverted soon after the events passed. It happened so many times in the past so I didn’t feel surprised. But I also found there was no high school student buying the food, which is kind of strange since it’s the time when afternoon classes were over. I asked again and was told the remaining high school adopted “military style management”, so the students were not allowed to go out during the rest anymore.</p>

<p>“Military style” school is trending in my hometown now. I really hate it. I think it’s just many parents don’t want to have the responsibility to raise the kids, and just throwing them into the school with strict rules. Chinese people value education very much. But a lot of them just like buying education, and think that’s enough. Like if they’ve spent enough money for the kids and the kids still didn’t turn out to be good, they’ve done everything they can and it’s not the parents’ fault. However, they don’t realise that parents are a big part of the education and no money can buy that part. And their standard of “good kids” are simple: have good scores in tests and be obedience. This explains why military style schools are so popular: the students spend little time at home so that the parents feel like they have less responsibility, and it sure sounds effective for raising obedience kids. Not all schools adopted that approach, so the business for after school classes are also popular even after the government abruptly banned the practise a few years ago. I even saw some “AI study room”, which is the most weird AI hype business I’ve ever seen, and one of them spells “AI” as “Ai”. I asked my friends what it’s about, and apparently it’s just a place that the parents can drop the kids after school, so that someone can monitor them to finish the homework. Education is surely one of reasons of my immigration to Canada: I don’t want my children to experience the kind of education I’ve had, which somehow even managed to get worse.</p>

<p>I also noticed the security for the schools seems to be leveled up. There are some barriers and police present in front of the school gate when parents picking up kids after school.  Maybe partly because of some random attacks happened across China not long before, with some cases targeted students.</p>

<p>After I started the hobby of retro computing, I started to buy more from Chinese used market since it’s much cheaper. This time I traveled with Thinkpad X220 instead of Thinkpad X1 Extreme, mainly want to test how it works when travelling, but also as an opportunity to buy some parts for it. I upgraded it with more RAM and bought a new battery for it. It turned out great: not only it can handle everyday tasks like Internet browsing without any problem, I can also play some light games on Steam. The small size of it makes it very easy to use on the road. Like I said in the last blog post about year end retrospective, I’ll write a more detailed blog about this laptop.</p>

<h2 id="beijing">Beijing</h2>

<p>Last year when I went back, I only stayed in Beijing for one night. This time, I went there for a slightly longer time of one day. I took the high speed train to Beijing, which is very common in China. It’s fast, cheap and much more comfortable than plane. I really wish Canada can build some high speed trains so that we don’t need to go to the airport just for a one hour flight.</p>

<p>I was able to meet with my cousin and some friends there. The lunch I had with my cousin was unexpected: we found a hotpot place where the brand is supposed to be a very popular one which both of us had multiple times in Beijing. But after ordering the food, we found out it’s a knock off: instead of XX Hotpot, it’s actually <strong>Authentic</strong> XX Hotpot. Unsurprisingly, the food was terrible. Which is a shame because I don’t have much opportunity to have food in Beijing anymore. But it’s also hilarious that two people that each stayed in Beijing for almost a decade can still fall for that kind of scam.</p>

<p>Like me, my cousin started to hate staying in Beijing after a few years. I think it’s a common thing between people in Beijing. Unlike southern China, no city in northern China is near the level of Beijing. So naturally, lots of people in northern China go to there for opportunities. However, like I said in the last year’s travel back to China blog, Beijing is a city that doesn’t care about normal people and can suck the energy out of the people. Because I got to stay in Beijing for longer this time, I experienced it more: after I left Beijing, I sometimes wondered, why I didn’t visit more places in Beijing when I was there, especially I worked from home during my last two years there. The visit to Beijing this time reminds me why: Beijing is a city that is so large and places so separated out. It has excellent subway. So people use it most of time. My day there was mostly spent underground in subway, where you cannot really see the city. Daily life like this makes people feel like they are just a part of a big machine. After many years in Beijing, the city was still like fragments to me. Only after I started to commute by bike not long before I left Beijing, I started to piece together the fragments and had a better sense of the city.</p>

<p>Subway maybe the most used transportation method in Beijing. So there is no surprise security checks happened a lot in the stations. Other than mandatory security checks at the entrance, police also check IDs randomly from time to time, which I encountered just after I arrived Beijing. I ignored them and just passed around when they were checking someone else’s ID and nobody seemed to care. I also encountered two security check points on the road, in a short distance taxi trip I took. One is on the inner city highway, which took a full lane. Another one is on the exit and blocked all the cars. It was evening rush hour, but the security checkpoints didn’t seem to care about the congestion it caused or worsened.</p>

<p>The gap of lifestyles in Beijing can be dramatic. Inside the second ring road, there are lots of ancient and traditional buildings left. The import government institutions are also in there. This is the Beijing when most of the people from other places think about it. However, in real world, there is little people live inside the second ring road. There are lots of companies and universities between 3rd and 5th ring road, which lacks some characters but more like a modern big city. Lots of the people live outside the fifth ring road. From a glance, many places there have no difference from an average small town in northern China. There is usually a big mall around for shopping and eating. Other than that, you mostly need to take subways for destinations. I read the awarding winning sci-fi novel <em>Folding Beijing</em> back when I was in Beijing. It tells a fiction society where three classes are strictly separated. I didn’t have much thought when I read it. But visiting Beijing again, taking subways underground all the day, I suddenly remembered it and found its fiction is so real on so many levels.</p>

<p>The night in Beijing was still cold during the winter. But it’s nice to have dinner with old friends. We met at a place that used to be the heart of China’s IT industrial. The IT companies have moved to a very remote place since then, and lots of big malls occupied the area. I remember the place as a lively area but it was kind of quite that night. Maybe it’s because of the weekday. I was told Covid also hit the commercials big. After dinner, I slept over at a friend’s place, who owns a home but is renting now because his wife was transferred to another location temporarily for work. It reminds me the time when I was renting in Beijing. During my first months in Beijing, I was searching a place to live with a friend. We took the subway to a potential place that is very rural like. When my friend heard the landlord would pick us up from the subway station with a van, the kind that is usually seen in movies used by kidnappers, he was so afraid and tried to convince me to turn back. Nevertheless, we rented that place at the end. When another friend first came to Beijing, I was going to move away so I gave him the place. He said it was so remote that he felt like he already left Beijing just after the moment he arrived. I still have fond memory about the subway station there: standing in a rural place with dirts, grasses, strong wind, and snow sometimes, without any obstacles, you can see the elliptical shaped station from very far away. It is elevated high above the ground with the subway track going through it. The station is very new and shinny. It’s spacious and bright inside. During the early morning and late night, with the track and bridge hidden in the dark, the only thing you can see is the illuminated station floating in the air, with trains flying in and out slowly. It’s really like a space station in a newly developed planet. It’s the magic node connecting this area to the outside. People rushed in in the morning, packed the train so full that sometimes the subway station staff need to push people onto the train. Then people rushed out during the evening, when the moon took over the sky and climbed above the station. When I had dinner with the friend again and talked about it, he told me that place has been developed so much. There are lots of new buildings and commercials. Partially because the IT industrial in Beijing has moved to that direction. I wondered if the station can still be seen from far away.</p>

<h2 id="paris">Paris</h2>

<p>Time is so fast in China and it’s time to leave before we even adjusted. We planned Paris as the detour during the way back. I always knew Paris was a nice city, and France has so many influences to the world: its art, fashion, food, culture, politics and so on. But not until I saw many scenes of Paris when I watched the Olympic games last year , I started to have the desire of actually visiting it someday. I’ve never thought the day would come so soon.</p>

<p>With the experience in London, we carefully looked up the train we needed to take from the airport to the city. This time we didn’t take the wrong train, but instead, witnessed a scam/robbery: when the train was approaching a station, there were three men wanted to get off. I stood with the stroller where my daughter was sleeping, and they squeezed pass me. In retrospect, I felt something was not quite right because they were kind of rude. But this is another country so I didn’t know what is normal. There was a couple, maybe in their sixties, standing not far away from me. When one of the men passed the couple, he dropped a coin and started to look it up by lifting their suitcases, with the other two men joined to help. Soon the train stopped at the station and the door opened. Suddenly, the men started to grab the bags and suitcases by force. Luckily, the couple protected their luggages well. Before the robbers could grab anything, it started to attract attentions so they were forced to give up and started to run. Just after the moment I realized what happened, they were already at the door. I thought: “Shit! My suitcase is at the door!” But maybe the suitcase was too big and didn’t seem to worth much, they just glanced at it and escaped without a second thought.</p>

<p>I’ve heard Paris has petty crime problem, but witnessing one just after arriving shocked me and really let my guard up. Luckily we didn’t have much trouble to get to our hotel. And we only saw another similar accident once in the following days, when two police men chasing another men near Eiffel Tower.</p>

<p>Our hotel is at prime location: just across the Seine river from Notre-Dame. We took the similar plan as we were in London: visited Louvre Museum on the first day. It’s hard to not compare it with the British Museum. Even though I heard the British Museum had much more collections than Louvre Museum, no one can argue the architecture of Louvre is much more majestic. I would say the building itself is one of its most amazing collections.</p>

<p>The weather was not so good there. It was raining on the first day we arrived, then turned very cold on the second day. We visited Notre-Dame very early in the second morning, and just after we went out, it started to rain very heavily. The wind was so strong that it broke my umbrella. I was wet very soon without the umbrella. Despair and frustrated, it suddenly started to snow! I almost wanted to give up the day and went back to the hotel. But the rain and snow stopped pretty soon, so we continued the plan to walk in the city. After we’ve reached Arc de Triomphe and found a place to eat lunch, the sun started to came out. We headed to Eiffel Tower after lunch. The cloud soon cleared up and the city was so bright. The good weather continued through the day. When we walked across Seine river during the evening, the sunset was so beautiful: the water reflected the red sun, with rosy gold color on all the grand buildings nearby: The Conciergerie, Sainte-Chapelle, Notre-Dame and so on. Then we heard a soft song from a street singer on the bridge. At that moment, I thought Paris is really the capital of romantic.</p>

<p>Paris is truly beautiful. In the days there, the amazement only increased when I explored more and more of the city. I’ve never seen such a large area that has so many beautiful architectures in such a consistent style. It’s like being thrown into a totally different world and it’s really a dream like experience. Yet it has so much history. When I visited Chinese cities, I can associate lots of history events and figures to the places I visited, which gave me a much deeper feeling. Few foreign cities had that effect on me (maybe I would count Kyoto as one of them). But in Paris, even I have very limited knowledge of France, I can still associate lots of history events and figures to the places I visited, because there are just so many of them and it’s hard to not know, or at least heard some of them as long as you’ve studied basic history and literature in the middle school. That along makes the city so much richer than just a “beautiful city”.</p>

<p>Overall the experience in Paris is very positive even considering the accidents I witnessed. The food is also surprisingly good. Somewhat like its history, the rich of the tastes is something I rarely got from the non Chinese food. The only thing I wish could be better is how long I stayed there: three and a half days are just so little for a city like Paris. But it’s already what we could do for a vacation with such packed schedule. And considering it’s only a detour, which means we didn’t pay more for the flights compared to go back directly, we couldn’t ask for more.</p>

<h2 id="new-york">New York</h2>

<p>We transferred through New York again when we flew from Paris to Toronto. We spent lots of time in Paris airport because the check-in agent couldn’t figure out our complex passports, visas and routes. The plane was also late for a little bit so the time left for transferring in New York is very tight. I’ve had so many connecting flights but this time it’s different: we needed to go out of the boarding area when the plane arrived, and do security checks again to enter. More than one bags alerted during the security checks, maybe because of the baby food and snacks we brought. So I was asked to do a full body check, which certainly didn’t help with the tight time. I was asked multiple questions for the consensual, only to find out it’s the kind of the search that is very common in some Chinese airports and train stations, which almost every passenger need to do.</p>

<p>After packing the bags again following the search, I was very tired after such a long trip that I miscalculated the time zones, and thought we could barely make the next flight. So we ran to the gate only to find out there is an additional one hour. We didn’t pick up checked in luggages, because those are usually carried to the final destination with the planes.  But it turned out we needed to do so for this transfer. We only knew that after we’ve arrived Toronto when couldn’t find our luggages. I filled out a form to have them be shipped to my home the next day, and went back without those suitcases.</p>

<p>I was so tired after came back, and still dreamed Paris for a few days after. However, the trip turned out very good and it makes me more determined to travel back every year. And I got some good rest and was able to get the time to write this blog post. Thinking about the travel, it’s almost like living two life styles: in Canada, the time is slow, long and peaceful. I spend most of the time on work, research and my own small family instead of socializing. While in China, the time is fast. I spend almost all the time to connect with my extended family and friends, eating outside, traveling on the way and so on. Hopefully I can continue to go back and record it every year in the foreseeable future.</p>]]></content><author><name></name></author><category term="China" /><category term="travel" /><category term="life" /><category term="Paris" /><category term="London" /><summary type="html"><![CDATA[As I mentioned in the last year-end retrospective blog, I traveled back to China again this year (to be more accurate, at the end of 2024). I migrated to Canada just before Covid, which is why I only started to go back since last year after the Covid measures were all lifted. I think it’s worth a blog post every time I go back there since China, as a country I spent most of my life, not only made what I am today, but also still has so many connections with me. Most of my friends and extended family members are still there. I also find going back every year instead of living there all the time makes me a better observer. I can notice the changes easier since I’m comparing the things from year to year instead of from day to day. I can notice the things that people living there all the time may have already gotten used to. I also have more motivation to observe since the short time of visiting is so valuable to me. So here it is, the 2025 edition of travelling back to China.]]></summary></entry><entry><title type="html">Replace A Dead Node in My High Availability Cluster</title><link href="https://www.binwang.me/2025-02-01-Replace-A-Dead-Node-in-My-High-Avaliable-Cluster.html" rel="alternate" type="text/html" title="Replace A Dead Node in My High Availability Cluster" /><published>2025-02-01T00:00:00-05:00</published><updated>2025-02-01T00:00:00-05:00</updated><id>https://www.binwang.me/Replace-A-Dead-Node-in-My-High-Avaliable-Cluster</id><content type="html" xml:base="https://www.binwang.me/2025-02-01-Replace-A-Dead-Node-in-My-High-Avaliable-Cluster.html"><![CDATA[<p>In my previous blogs <a href="/2023-03-13-Infrastructure-Setup-for-High-Availability.html">[1]</a><a href="/2023-11-28-Introduce-K3s-CephFS-and-MetalLB-to-My-High-Avaliable-Cluster.html">[2]</a>, I’ve introduced my high availability cluster setup. It works really well: when 1 of the 3 nodes is down, the service either continues to be online, or can be recovered rather quickly (in the case I set service replica to 1 to avoid wasting the resource). However, in the beginning of this year, one node went down not because of regular updates or temporary shutdown for maintenance, but because its system disk died. While it’s annoying to replace the disk and bring it back, it’s actually a good opportunity to verify a dead node can be replaced in my setup. So I will note the steps down in this article. This will be a short one but it shows how easy it is.</p>

<h2 id="what-was-lost">What Was Lost?</h2>

<p>The dead disk is the system disk. It has the OS, but also has the data for CockroachDB and ElasticSearch. However, since the data for CockroachDB and ElasticSearch is replicated across the cluster, it can be recovered from other machines.</p>

<p>The machine also has a separate disk for CephFS but that disk was not lost. The data in CephFS is also replicated so should be able to recover from other machines as well even if it were dead. But it may need additional setup, like changing the disk uuid in Rook’s Kubernetes manifests.</p>

<h2 id="why-not-recover-from-backup">Why Not Recover From Backup?</h2>

<p>First of all, I don’t back up that often because I don’t feel the need considering the data is replicated. Another reason is, I set up this machine based on the usage of <a href="/2021-09-19-Personal-ZFS-Offsite-Online-Backup-Solution.html">offsite online backup</a>. Then I repurposed it for use in this HA cluster. I want to change the secure boot setup because the threat model is different so it doesn’t need such complex boot setup, which is not supported very well by mainstream Linux without TPM 2.0.</p>

<p>Since the data can all be recovered from other machines automatically, it would be easier to just install a fresh OS and some basic infrastructure so that all the service deployments and data can be automatically recovered. This also simulates a dead node situation, so that I have more confidence for recovering from such failures in the future.</p>

<h2 id="how-to-recover">How to Recover?</h2>

<p>Okay, here we are for the actual recovery steps. It’s very simple:</p>

<p>First, install the OS. Configure basic things like network IP address, ssh, etc. Install things like <code class="language-plaintext highlighter-rouge">prometheus-node-exporter</code> if you are using it on other machines.</p>

<p>Next step is to let the node join our Kubernetes cluster. Before that, we can remove the old dead node in the Kubernetes cluster by using the command <code class="language-plaintext highlighter-rouge">kubectl delete node ...</code>.</p>

<p>Then install k3s: Copy the config file under <code class="language-plaintext highlighter-rouge">/etc/rancher/k3s/config.yaml</code> from another machine and adjust the node IP and network interface config. Make sure the config has something like <code class="language-plaintext highlighter-rouge">server: https://...:6443</code> so it will join the existing cluster instead of creating a new cluster. Check the k3s versions on other machines by using <code class="language-plaintext highlighter-rouge">kubectl get nodes -o wide</code>. Then install it with <code class="language-plaintext highlighter-rouge">curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.31.1+k3s1 sh -</code> assuming <code class="language-plaintext highlighter-rouge">v1.31.1+k3s1</code> is the version.</p>

<p>After k3s is installed, the k3s service should be enabled by default and the node should join our cluster automatically. If the hostname and IP address are the same as the dead machine, the Kubernetes cluster should automatically reschedule the services onto this machine. If there are some failed containers, check the log to see if it’s because the local directory for the storage is missing. In my case, I need to create the local directory for CockroachDB and ElasticSearch, and set the owner to 1000 for ElasticSearch.</p>

<p>At last, we need to make sure CephFS is working. Make sure <code class="language-plaintext highlighter-rouge">ceph</code> and <code class="language-plaintext highlighter-rouge">rbd</code> can be loaded with <code class="language-plaintext highlighter-rouge">modprobe</code>. If so, add them to <code class="language-plaintext highlighter-rouge">/etc/modules-load.d</code> to load on boot:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat /etc/modules-load.d/ceph.conf
ceph
cat /etc/modules-load.d/rbd.conf
rbd
</code></pre></div></div>]]></content><author><name></name></author><category term="Kubernetes" /><category term="infrastructure" /><category term="high availability" /><summary type="html"><![CDATA[In my previous blogs [1][2], I’ve introduced my high availability cluster setup. It works really well: when 1 of the 3 nodes is down, the service either continues to be online, or can be recovered rather quickly (in the case I set service replica to 1 to avoid wasting the resource). However, in the beginning of this year, one node went down not because of regular updates or temporary shutdown for maintenance, but because its system disk died. While it’s annoying to replace the disk and bring it back, it’s actually a good opportunity to verify a dead node can be replaced in my setup. So I will note the steps down in this article. This will be a short one but it shows how easy it is.]]></summary></entry><entry><title type="html">A 2-Year Reflection for 2023 and 2024</title><link href="https://www.binwang.me/2025-01-13-A-2-Year-Reflection.html" rel="alternate" type="text/html" title="A 2-Year Reflection for 2023 and 2024" /><published>2025-01-13T00:00:00-05:00</published><updated>2025-01-13T00:00:00-05:00</updated><id>https://www.binwang.me/A-2-Year-Reflection</id><content type="html" xml:base="https://www.binwang.me/2025-01-13-A-2-Year-Reflection.html"><![CDATA[<p>Here is another new year! I didn’t write any retrospective for 2023, so I’d like to combine it with 2024 together in this one since 2023 is a very important year for me. I planned to write it after I came back from <a href="/2024-03-19-Travel-Back-to-China.html">China’s travel</a> but never got the time. When I finally got the time it was already too late to write a year-end review. I’m staying in China again now (when I first started this article but I’ve been back when publishing it). While there is no work related thing to worry about when on vacation, I can use this time to review what happened in the past 2 years so it won’t slip through again.</p>

<h2 id="a-new-life">A New Life</h2>

<p>The most significant event happened in the past 2 years is that I have a daughter. Having a baby changed my life so much that it felt overwhelming at first. I always tried to keep my life simple and predictable, but a baby is the contrary of that. Before having the baby, I thought raising a baby is a science. While it is true on some level, it is also a kind of art. In theory, it’s about meeting the baby’s needs by observing, like feeding the baby, changing the diaper and so on. But in reality, there are lots of guess work about what the baby needs when she is crying, and it’s very stressful if you cannot figure it out. And sometimes she just cries for no reason (e.g. baby colic). Also sometimes you know what needs to be done, but it takes so much energy to do it. For example, in the first months, it’s really hard to get any good sleep and it takes a big toll on everyday life and the mental health.</p>

<p>However, after a few months, especially once the baby is more than 1 year old, it’s really rewarding. Once she can babble words, run all over the place, and express her cares about me, I feel everything is worth it. It’s such a special feeling that knowing you are raising a new life and be responsible for it. Because of that, I feel like my life is more meaningful and I value it much more than before.</p>

<h2 id="the-old-project-rss-brain">The Old Project: RSS Brain</h2>

<p>On the other hand, my old project <a href="https://rssbrain.com">RSS Brain</a> continues to get improvements. I’m so happy about this project because I can put theory into real world after I learned something. And I continue to use it everyday to get information from the Internet. Now I cannot imagine my Internet browsing without it. The feature set and backend is mostly stable in 2023. At the beginning of 2024, I started to refactor the frontend from Flutter to web tech. Mainly using htmx and Alpine.js with Scala to generate the html code from backend. This is such an important decision for this project and it turned out perfectly. Not only it feels native on the web platform, the code is also much cleaner since it’s mostly Scala now – the same as the backend. The performance is also much better since most of the html is rendered from the server so the client needs little power compared to a Flutter one. On the way I also developed some frontend libraries for the project’s needs and wrote some blogs.</p>

<p>In the middle of 2024, I started to release the code of this project on a regular basis: I feel like the project is so cool that it’s a shame if only I can see the source code. There is still room for improvement on that side since there are still lots of doc missing, and it uses some other of my open source libraries that may not be released very properly. It would be great to clean it up so it’s easier to build and run on a fresh machine.</p>

<p>There are also some UI and quality of life features that can continue to be improved. This would still be a project I continue to work on and release on a regular basis. Sometimes working on these small features is more like gardening, which makes me feel peaceful and relaxed when too many other things are happening in the life.</p>

<p>I may also add more AI features. I know AI is kind of a hype word now but that was the plan from the beginning. The name RSS Brain is meant to have some analysis features for the feeds so you can have more insights from them. The features it has now like search filters, related articles are like this. With the advance of the LLM, more things are possible and I’m pretty excited about. But before that, I need to find some proper infrastructure to do that, which will be discussed in the following sections about databases.</p>

<h2 id="distributed-system-infrastructure">Distributed System Infrastructure</h2>

<p>With RSS Brain as a use case, I try to setup my own infrastructure for a high available system. It took shape in 2023 and the final version can be found in <a href="/2023-11-28-Introduce-K3s-CephFS-and-MetalLB-to-My-High-Avaliable-Cluster.html">this blog post</a>. It works really good that I didn’t change much things in 2024. However there is a very important part that’s still missing: the upgrade of the system components. Since the services and even some infrastructure like CephFS are running in containers, there is no single command you can run to upgrade all the packages. Some images can be very outdated that there are some security risks. I meant to write some tools for upgrades but didn’t find the time for it. I suspect I wouldn’t have time for that in the coming year either but just list it here so that I don’t forget.</p>

<h2 id="in-search-of-databases">In Search of Databases</h2>

<p>There are 2 things motivated me to search for a new database: the license change of CockroachDB and the needs of vector similarity search. This work took lots of my time in the second half of 2024. For the first point, I talked it briefly in the blog post <a href="/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html">Jepsen Test on Patroni</a>, which I’m very proud of. I may write another blog post about it too so I will save the details for the future.</p>

<p>For the vector database, I think anyone that follows the AI trend recently would be all too familiar with that. There are 2 use cases for my projects: recommendation for RSS Brain using embeddings, and general RAG for LLMs.</p>

<p>I talked about the recommendation use case in the blog post <a href="/2023-11-14-Update-On-RSS-Brain-to-Find-Related-Articles-with-Machine-Learning.html">Update on RSS Brain to Find Related Articles with Machine Learning</a>: it’s now using Elastic Search for vector similarity search, but the performance is not so good with my less powerful cluster. So I’m searching for some other solutions that have a lower memory footprint. And if possible, use it for full text search as well so that I don’t need to run Elastic Search at all.</p>

<p>About general RAG for LLMs, I’ve been interested in neural network based machine learning back in 2015. In my <a href="/2018-01-02-The-Year-of-2017.html">2017 retrospective</a>, I talked about the AI project I was working on. That’s when I read the paper <em>Attention is All You Need</em>, which is the foundation of the nowadays LLMs. I always wanted to develop some RPG game with human like AI and it seems finally to be technically possible. Along with some other tools I want to develop with LLMs in the coming year, a database that is capable for vector search is a must have. I’ve already done lots of research on that area but didn’t feel like I reached to a point to share it. So I’ll save it for a future blog post as well.</p>

<h2 id="blog-posts">Blog Posts</h2>

<p>In 2023, I <a href="/2023-11-10-Index-Sidebar-on-My-Blog.html">added an index sidebar to my blog</a>. It works really well. The blog posts in the past 2 years not only have better structure, but the quality is also better. My perspective has changed when I write a new blog post: instead of a one-off writing and maybe something I’ll not read again so much, the index structure makes me feel like I’m filling some holes in my knowledge graph, and it’s a forever ongoing project so I take more care about it. And like I mentioned in the previous blog post above, the structure also makes me aware what areas I’m focusing on so that I can adjust it if it’s not consistent with my goal.</p>

<h2 id="retro-hardware">Retro Hardware</h2>

<p>I always liked the aesthetic of some old hardware. The mechanical parts instead of a flat block with touch screen makes the design much more interesting and fun to use. In the past two years, I bought some retro hardware and it became a hobby:</p>

<p>In the first months of having my daughter, she had reflux so I needed to hold her to sleep very often to avoid spitting up milk. In order to make the time easier, I bought a retro mini handheld game console <a href="https://officialmiyoomini.com/product/miyoo-mini-plus-v2-official-store-sale">Miyoo Mini+</a> so I can play games while holding her to sleep. It can simulate lots of games up to PS2 era. It opened a new world to me because there are so many great games from the past that are still fun to play nowadays. In the past, I’ve already realised good games are not only good graphics. If we read classical books from many years ago, why shouldn’t we treat classic games like that? After flashing the custom ROM <a href="https://github.com/OnionUI/Onion">OnionUI</a>, I can also write some modern retro games for myself with platforms like <a href="https://tic80.com/">TIC-80</a>. The only complaint is it doesn’t have bluetooth built in so I cannot use my Airpods (another device I bought to make the time of holding the baby easier) with it.</p>

<p>Another big part of the retro hardware is related to radio: I bought some AM/FM analog radios, wideband digital radios, as well as software defined radio devices. I got interested in it after reading the book <a href="https://www.goodreads.com/book/show/18114087-the-knowledge">The Knowledge: How to Rebuild Our World from Scratch</a>. It just feels so fun with the idea that you can send and capture information through the air. Technically modern technologies like mobile networks and WIFI is still sending info through air, but the digital signal and too many layers added makes it less fun. Additionally, when it’s easy to find specific things you want to consume nowadays, it’s very relaxing to consume passively like the old time TV channels. That’s kind of the idea I described in the blog <a href="/2024-06-03-Random-Video-Playlists-for-Self-Hosted-Videos.html">Random Playlists for Self Hosted Videos</a>. But even more relaxing with radios because you don’t even need to watch anything.</p>

<p>I also bought a second hand Nokia phone <a href="https://en.wikipedia.org/wiki/Nokia_5730_XpressMusic">Nokia 5730</a>. It’s very similar to my first phone Nokia 5320, but with a slide out keyboard. I wanted to play some old Symbian or Java mobile games but didn’t get much chance. It may stay in the drawer for a longer time because I have lots of other retro plans in the new year.</p>

<p>At last, I got a Thinkpad X220. This may be worth a blog post on its own (again) so I will just be brief here. This is the laptop I always wanted since I graduated (X200 before that). The company I was working for gave us Macbooks, which you cannot really complain. In the following years, I also bought Macbook as my personal laptop because it’s very hard to find the same level screen on other laptops. In recent years, there are multiple times I wanted to buy a used one for various reasons, but didn’t do that mainly because of the price. I finally did it in the last year since I found the second hand market in China is much cheaper than the ones in north America. So I got one pretty cheap and I’m surprised how capable it still is. I meant to play some old Windows game on it but it seems to be such a waste so I installed Linux on it. It works so well that I’m actually taking it for travel instead of my Thinkpad X1 Extreme (gen1).</p>

<p>About old PC gaming, I dual booted Windows on the Thinkpad X220. But then I found my old Thinkpad E40 at my parents’ place. So I will use the Thinkpad E40 for the purpose instead, which is pretty appropriate since it’s the first laptop I had. I played many games on it and there is no better device if I want to play more games from that era.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Overall the life is more structured even with a new baby in my life. One factor may be I’ve moved to Toronto for a while and I’m used to the routine. The end of Covid is another big factor to make not only me but a lot of other people’s life back on track. I think a sentence in <a href="https://mastodon.binwang.me/@wangbin/112794830887081984">one of my poems</a> encapsulated my past 2 years pretty well but it’s hard to translate to English:</p>

<blockquote>
  <p>抱朴寻微末，结庐草太玄。</p>
</blockquote>

<p>“抱朴” (Bao Pu) is referred to the classical Taoist text <a href="https://en.wikipedia.org/wiki/Baopuzi">抱朴子</a>. Literally, it means “hug the simplicity”. “太玄” (Tai Xuan) is another famous text in Chinese history <a href="https://en.wikipedia.org/wiki/Taixuanjing">太玄经</a>. Literally, 太 means “too”, “supreme” or “great”. 玄 is a very special word in Taoist, which means mysterious, profound or difficult to understand. The author Yang Xiong spent lots of time to write it but it doesn’t really have any practical use and his work was not very mainstream in his era. So the sentence above can be roughly translated into:</p>

<blockquote>
  <p>Embracing the simplicity, I seek the profound and subtle.</p>

  <p>In a humble hut, I write Tai Xuan in solitude.</p>
</blockquote>]]></content><author><name></name></author><category term="life" /><summary type="html"><![CDATA[Here is another new year! I didn’t write any retrospective for 2023, so I’d like to combine it with 2024 together in this one since 2023 is a very important year for me. I planned to write it after I came back from China’s travel but never got the time. When I finally got the time it was already too late to write a year-end review. I’m staying in China again now (when I first started this article but I’ve been back when publishing it). While there is no work related thing to worry about when on vacation, I can use this time to review what happened in the past 2 years so it won’t slip through again.]]></summary></entry><entry><title type="html">Jepsen Test on Patroni: A PostgreSQL High Availability Solution</title><link href="https://www.binwang.me/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html" rel="alternate" type="text/html" title="Jepsen Test on Patroni: A PostgreSQL High Availability Solution" /><published>2024-12-02T00:00:00-05:00</published><updated>2024-12-02T00:00:00-05:00</updated><id>https://www.binwang.me/PostgreSQL-High-Availability-Solutions---Part-1</id><content type="html" xml:base="https://www.binwang.me/2024-12-02-PostgreSQL-High-Availability-Solutions-Part-1.html"><![CDATA[<p><em>Note: code used in this article can be found on the Github repo <a href="https://github.com/wb14123/jepsen-postgres-ha">jepsen-postgres-ha</a>.</em></p>

<p>I’ve used Cockroach DB for a few of my side projects. I enjoyed it overall. But since it <a href="https://www.cockroachlabs.com/enterprise-license-update/">announced license change</a> and require mandatory telemetry collection for free version, I started to look for alternatives. The most natural choice is to just use the plain old PostgreSQL since my data size is not that big and even a less powerful machine can handle it without any problem. One of my important requirements for the database is to have good high availability setup so that I can just shutdown a machine for maintenance from time to time. This series of blog posts will focus on PostgreSQL’s HA solutions instead of why do we need that. Not saying why is not important enough but I’ll save that discussion for another blog post out of this series.</p>

<p>PostgreSQL doesn’t come with native high availability solution. Instead, it has features like replication to support you build your own HA solution. But we all know distributed system is hard to build and error-prone. So I’m planning to test different solutions before I trust my data with them: mainly using <a href="https://jepsen.io">Jepsen</a> to test the correctness at first and if it passes the test, benchmark it to make sure it’s usable in real world.</p>

<p>In the first part of this series, I’ll introduce the basic Jepsen test setup. Then use my early test result from <a href="https://github.com/patroni/patroni">Patroni</a>, a very popular PostgreSQL HA solution, as an example. In the test with Patroni, I’m able to:</p>

<ul>
  <li>Reproduce a known issue that causes violation of read committed isolation. This is related to a fundamental flaw in PostgreSQL’s replication implementation.</li>
  <li>Observe the cluster failed to recover with 1 node lost out of 3 nodes in total.</li>
</ul>

<p>Ideally I would like to do more tests and deeper digging into it, but I may not have enough free time in the coming 1 or 2 months so I’d like to record some result here and maybe have some updates later.</p>

<h2 id="jepsen-test-setup">Jepsen Test Setup</h2>

<p>The tool used in the tests is Jepsen. For the ones who are not familiar with it, it’s a tool to test the correctness of distributed systems. I highly recommend anyone interested in distributed systems to read its <a href="https://jepsen.io/analyses">analyses</a>, which have found bugs in almost every system it has tested, including <a href="https://jepsen.io/analyses/postgresql-12.3">PostgreSQL 12.3 with single machine setup</a>. On a high level, it runs queries and check if the data is consistent at the end, at the same time it has many built-in failures (nemesis in Jepsen’s term) can be introduced during the query, like node crash, network partition, network slowdown and so on. The analyses of PostgreSQL 12.3 already does an excellent job to explain how Jepsen tests PostgreSQL, so I’ll not repeat it here. I borrowed the append and read workload from that test but with 2 differences on other parts :</p>

<ol>
  <li>The database is setup in a different way. In the original test, it only tests a single machine PostgreSQL but the bugs are already fixed. So we are going to test a HA setup.</li>
  <li>In the original test, it was able to find bugs without importing any failures. But since that bug has been fixed, we are going to enable different built-in failures like node crash and network slowness to test if the PostgreSQL cluster can still behave correctly or not.</li>
</ol>

<p>To be more specific, I use <a href="https://www.vagrantup.com/">Vagrant</a> to create a 3 nodes virtual machine cluster and install Kubernetes (with <a href="https://k3s.io/">k3s</a>)on it. The Vagrantfile is <a href="https://github.com/wb14123/jepsen-postgres-ha/blob/master/cluster/Vagrantfile">here</a>. It’s mostly from a previous project I created to test k3s, as described in the blog post <a href="/2023-11-28-Introduce-K3s-CephFS-and-MetalLB-to-My-High-Avaliable-Cluster.html">Introduce K3s, CephFS and MetalLB to My High Available Cluster</a>. This setup makes future tests for different HA solutions convenient since most of them supports Kubernetes, so that I can just create yaml files for different systems, while only need to implement Jepsen’s interface once to define database setup, tear down, kill and recover:</p>

<ul>
  <li>For setup, just use <code class="language-plaintext highlighter-rouge">kubectl create -f &lt;manifest.yaml&gt;</code>.</li>
  <li>For tear down, just use <code class="language-plaintext highlighter-rouge">kubectl delete -f</code> to delete the whole thing.</li>
  <li>To kill the db, find the root k3s process and <code class="language-plaintext highlighter-rouge">kill -9</code> it along with all its children process. Make sure to also stop the systemd service so it will not be automatically started again.</li>
  <li>To recover the db, simply start the k3s service again so that the pods will be scheduled on the node again. It just makes sure the k3s service is started. More health checks are needed if really want to wait for the db to be really recovered but it’s good for now.</li>
</ul>

<p>Related code is at <a href="https://github.com/wb14123/jepsen-postgres-ha/blob/3f5690f15cb34fdc196067786a48004dceae7ca8/src/jepsen/postgres_db.clj#L122">here</a>.</p>

<p>The code is meant to support any HA setup as long as it can be defined with a Kubernetes manifest. It supports <code class="language-plaintext highlighter-rouge">--cluster</code> flag so that can specify which manifest to test. I created a single node PostgreSQL setup and a Patroni setup for now. But in reality, Patroni has some special things that need to be taken care of, like delete PV and endpoints. I’ll clean those things up when my focus is moved to other HA solutions.</p>

<p>For the introduced failures, ideally we should test all the supported failures combined randomly. But the state space is large and need a long time to run. So I just created a specific combination to reproduce a known issue.</p>

<h2 id="a-known-issue-of-postgresql-replication">A Known Issue of PostgreSQL Replication</h2>

<p>When I searched for PostgreSQL HA solutions and whether any of them is tested by Jepsen, I found some <a href="https://news.ycombinator.com/item?id=23499611">comments on Hackernews</a> that says Patroni doesn’t guarantee consistency under some scenarios, which lead me to <a href="https://twitter.com/cyberdemn/status/1265710557006630913">the Twitter discussion</a>, which states there is a fundamental flaw in PostgreSQL’s replication that makes it really hard to implement HA without data loss.</p>

<p>Here is the problem: usually with synced replication, a transaction should only be committed and visible after the replica db has persisted the transaction. So that if the primary db failed over to replica, there will be no data loss. But in a special scenario, where the query is cancelled after the client sent the commit command, PostgreSQL will consider it as committed even the transaction is not replicated yet. So when a failover happens at this time, this “committed data” will be lost from clients’ point of view. Here is an example:</p>

<table>
  <thead>
    <tr>
      <th>Time</th>
      <th>Node 1</th>
      <th>Node 2</th>
      <th>C1</th>
      <th>C2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Role: Primary<br />Visible data: k -&gt; [1]</td>
      <td>Role: Replica<br />Visible data: k -&gt; [ 1 ]</td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>2</td>
      <td>Role: Primary<br />Visible data: k -&gt; [1]</td>
      <td> </td>
      <td>T1 start</td>
      <td> </td>
    </tr>
    <tr>
      <td>3</td>
      <td>Role: Primary<br />Visible data: k -&gt; [1]</td>
      <td> </td>
      <td>T1 append 2 to k</td>
      <td> </td>
    </tr>
    <tr>
      <td>4</td>
      <td>Role: Primary<br />Visible data: k -&gt; [1]</td>
      <td> </td>
      <td>T1 commit</td>
      <td> </td>
    </tr>
    <tr>
      <td>5</td>
      <td>T1 replication started</td>
      <td> </td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>6</td>
      <td> </td>
      <td> </td>
      <td>T1 aborted(conn close? client kill?)</td>
      <td> </td>
    </tr>
    <tr>
      <td>7</td>
      <td>T1 replication not finished, but T1 is visible to other clients<br /><br />Role: Primary<br />Visible data: k -&gt; [1,2]</td>
      <td>Role: Replica<br />Visible data: k -&gt; [ 1]</td>
      <td> </td>
      <td>T2 read k, result = [1,2]</td>
    </tr>
    <tr>
      <td>8</td>
      <td>Node crash</td>
      <td>Role: Primary<br />Visible data: k -&gt; [1]</td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>9</td>
      <td> </td>
      <td> </td>
      <td> </td>
      <td>T3 read k,<br />result = [1]</td>
    </tr>
    <tr>
      <td>10</td>
      <td> </td>
      <td> </td>
      <td>T4 append 3 to k</td>
      <td> </td>
    </tr>
    <tr>
      <td>11</td>
      <td> </td>
      <td>Role: Primary<br />Visible data: k -&gt; [1,3]</td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td>12</td>
      <td> </td>
      <td> </td>
      <td> </td>
      <td>T5 read k,<br />result = [1, 3]</td>
    </tr>
  </tbody>
</table>

<p>In the example above, the value of <code class="language-plaintext highlighter-rouge">k</code> is <code class="language-plaintext highlighter-rouge">[1]</code> at the beginning. C1 will append monotonically increasing values to <code class="language-plaintext highlighter-rouge">k</code>. (It tracks the value locally instead of query <code class="language-plaintext highlighter-rouge">k</code> every time). T1 is aborted before it’s replicated. But even so, the primary node still treat this transaction as committed. So when C2 queries with T2, it get results with <code class="language-plaintext highlighter-rouge">[1, 2]</code>. Then at time 8, the primary is failed over from node 1 to node 2, so when T3 queries <code class="language-plaintext highlighter-rouge">k</code>, it returns <code class="language-plaintext highlighter-rouge">[1]</code> instead of <code class="language-plaintext highlighter-rouge">[1,2]</code>. This is an obvious data loss in our point of view because we know exactly the order of events. But one can argue it misses linearizable guarantee since technically, T3 can be ordered before T2 or even T1, and it will produce a consistent history, thus violates linearizable but not serializable. However, with T5 that has the result of <code class="language-plaintext highlighter-rouge">[1,3]</code>, it creates a situation that conflict with T2:</p>

<ul>
  <li>If T2 is before T5, T5 should has 2 in the result.</li>
  <li>If T5 is before T2, T2 should has 3 in the result.</li>
</ul>

<p>This is not only a violation of serializability, but also read committed because T2 has read the uncommitted data from the client’s point of view.</p>

<h2 id="patroni-setup-for-testing">Patroni Setup for Testing</h2>

<p>Even this is a known issue and is documented, I still try to reproduce it in my test for a few reasons: first, I want to make sure my test is good enough to actually be able to reproduce it. Second, I want to see it happens in real world: the Patroni auto failover makes manually triggering this problem hard because there is only a short time for the commit to be replicated.</p>

<p>In my test, I try to setup Patroni to make it prioritize consistency the most. The config is at <a href="https://github.com/wb14123/patroni/blob/9ecfbe6209af3b3fd686d77bb9beb04deabaf5a9/kubernetes/entrypoint.sh#L14">here</a> for the Docker’s entrypoint script and <a href="https://github.com/wb14123/jepsen-postgres-ha/blob/04ddd2bcc107f81eb8abdbcf14f0ad5e396dc6d0/cluster/patroni/k8s.yaml#L113">here</a> for the config in Kubernetes. The PostgreSQL version is 16 and Patroni version is v4.0.3.</p>

<p>The key configurations are about replication mode. The description of each parameter below is copied from <a href="https://patroni.readthedocs.io/en/latest/replication_modes.html">Patroni document about replication modes</a>:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">synchronous_mode</code> is set to <code class="language-plaintext highlighter-rouge">on</code>: When <code class="language-plaintext highlighter-rouge">synchronous_mode</code> is turned on Patroni will not promote a standby unless it is certain that the standby contains all transactions that may have returned a successful commit status to client. Turning on synchronous_mode does not guarantee multi node durability of commits under all circumstances. When no suitable standby is available, primary server will still accept writes, but does not guarantee their replication.</li>
  <li><code class="language-plaintext highlighter-rouge">synchronous_mode_strict</code> is set to <code class="language-plaintext highlighter-rouge">on</code>: When it is absolutely necessary to guarantee that each write is stored durably on at least two nodes, enable <code class="language-plaintext highlighter-rouge">synchronous_mode_strict</code> in addition to the <code class="language-plaintext highlighter-rouge">synchronous_mode</code>. This parameter prevents Patroni from switching off the synchronous replication on the primary when no synchronous standby candidates are available.</li>
  <li><code class="language-plaintext highlighter-rouge">synchronous_node_count</code> is left to default as <code class="language-plaintext highlighter-rouge">1</code>: The parameter <code class="language-plaintext highlighter-rouge">synchronous_node_count</code> is used by Patroni to manage the number of synchronous standby databases. It is set to 1 by default. It has no effect when <code class="language-plaintext highlighter-rouge">synchronous_mode</code> is set to off. When enabled, Patroni manages the precise number of synchronous standby databases based on parameter <code class="language-plaintext highlighter-rouge">synchronous_node_count</code> and adjusts the state in DCS &amp; <code class="language-plaintext highlighter-rouge">synchronous_standby_names</code> in PostgreSQL as members join and leave. If the parameter is set to a value higher than the number of eligible nodes it will be automatically reduced by Patroni.</li>
</ul>

<p>In PostgreSQL:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>synchronous_commit: "on"
synchronous_standby_names: "*"
max_connections: 500
</code></pre></div></div>

<p>As stated in the Patroni doc, even with this setup it still has the known issue described above:</p>

<blockquote>
  <p>Note: Because of the way synchronous replication is implemented in PostgreSQL it is still possible to lose transactions even when using synchronous_mode_strict. If the PostgreSQL backend is cancelled while waiting to acknowledge replication (as a result of packet cancellation due to client timeout or backend failure) transaction changes become visible for other backends. Such changes are not yet replicated and may be lost in case of standby promotion.</p>
</blockquote>

<p>This is the thing I want to reproduce.</p>

<h2 id="reproduce-read-committed-violation">Reproduce Read Committed Violation</h2>

<p>The reproduction of this failure is harder than I thought, even I knew exactly the requirement to trigger it at the beginning. There are a few factors contributed to this:</p>

<p>First, if the client doesn’t abort the connection itself, it’s hard to reproduce this scenario: The client not aborting the connection means it’s aborted by the primary node, which needs to introduce some failures to the primary node and that most likely makes it to failover immediately before the failure scenario is triggered. Jepsen’s built-in failures/nemesis are mostly on the server side. While not familiar with Clojure, it took me some time to figure out how to abort the connection just after sending the commit command. The code is at <a href="https://github.com/wb14123/jepsen-postgres-ha/blob/04ddd2bcc107f81eb8abdbcf14f0ad5e396dc6d0/src/jepsen/postgres/append.clj#L185">here</a>:</p>

<div class="language-clojure highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nb">and</span><span class="w"> </span><span class="n">break-conn</span><span class="w"> </span><span class="p">(</span><span class="nb">not</span><span class="w"> </span><span class="n">read-only?</span><span class="p">))</span><span class="w">
  </span><span class="p">(</span><span class="k">let</span><span class="w"> </span><span class="p">[</span><span class="n">result-chan</span><span class="w"> </span><span class="p">(</span><span class="nf">chan</span><span class="p">)</span><span class="w">
        </span><span class="n">close-chan</span><span class="w"> </span><span class="p">(</span><span class="nf">chan</span><span class="p">)</span><span class="w">
        </span><span class="p">]</span><span class="w">
    </span><span class="p">(</span><span class="nf">go</span><span class="w"> </span><span class="p">(</span><span class="nf">&gt;!</span><span class="w"> </span><span class="n">result-chan</span><span class="w"> </span><span class="p">(</span><span class="nf">try</span><span class="w"> </span><span class="p">(</span><span class="nf">run</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nf">catch</span><span class="w"> </span><span class="n">Throwable</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="n">e</span><span class="p">))))</span><span class="w">
    </span><span class="p">(</span><span class="nf">go</span><span class="w"> </span><span class="p">(</span><span class="nf">&lt;!</span><span class="w"> </span><span class="p">(</span><span class="nf">timeout</span><span class="w"> </span><span class="mi">120</span><span class="p">))</span><span class="w">
        </span><span class="p">(</span><span class="nf">try</span><span class="w"> </span><span class="p">(</span><span class="nf">c/close!</span><span class="w"> </span><span class="n">conn</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nf">catch</span><span class="w"> </span><span class="n">Throwable</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="n">e</span><span class="p">))</span><span class="w">
        </span><span class="p">(</span><span class="nf">&gt;!</span><span class="w"> </span><span class="n">close-chan</span><span class="w"> </span><span class="n">true</span><span class="p">))</span><span class="w">
    </span><span class="p">(</span><span class="nf">&lt;!!</span><span class="w"> </span><span class="n">close-chan</span><span class="p">)</span><span class="w">
    </span><span class="p">(</span><span class="k">let</span><span class="w"> </span><span class="p">[</span><span class="n">result</span><span class="w"> </span><span class="p">(</span><span class="nf">&lt;!!</span><span class="w"> </span><span class="n">result-chan</span><span class="p">)]</span><span class="w">
      </span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nb">instance?</span><span class="w"> </span><span class="n">Throwable</span><span class="w"> </span><span class="n">result</span><span class="p">)</span><span class="w">
        </span><span class="p">(</span><span class="nf">throw</span><span class="w"> </span><span class="n">result</span><span class="p">)</span><span class="w">
        </span><span class="n">result</span><span class="w">
        </span><span class="p">)))</span><span class="w">
  </span><span class="p">(</span><span class="nf">run</span><span class="p">))))</span><span class="w">
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">break-conn</code> is a boolean that produced from a random number. If it’s not a read only transaction, the client will close the connection after a timeout of 120ms. As I will talk about later, I introduced the network slowdown to make the round trip about 100ms in average, so a complete replicated transaction should take 100ms (client to primary) + 100ms (primary to replica) = 200ms. So 120ms will hopefully makes the connection abort after the commit command is sent to the primary but before the transaction is replicated. This is definitely not perfect, but it’s the best I can do for now.</p>

<p>Second, the nemesis package I copied from the original test is not very suitable to reproduce this issue: it randomly introduce different failures (passed in by cli flags) by a predefined average interval. So it needs some luck to get the scenario that triggers this failure: a combination of slow network + primary failure. So after I studied the failure scenario again, I changed the nemesis suite to create slow network all the time, and create a 90 seconds cycle for primary failure (30 seconds healthy state + 60 seconds node killed).</p>

<p>Third, the default work load will create multiple keys and run multiple ops in a single transaction. Which makes it harder to trigger the exact scenario (the transaction network round trip time will be different depends on the number of ops so 120ms is less likely to close connection after commit). After thinking more about the requirement to trigger the failure, I tuned the parameters to operate on a single key most of the time and only have a single operation per transaction.</p>

<p>The other factors are mostly about my setup with VM + Kubernetes. For example, I need to figure out ways to do things like crash the node (k3s root process + children process). I also need to adjust the network interface name in network related failures since the VM created <code class="language-plaintext highlighter-rouge">eth1</code> instead of <code class="language-plaintext highlighter-rouge">eth0</code> hard coded in Jepsen.</p>

<p>At last, you need a little bit luck: the failures is hard to trigger so my test doesn’t reproduce it every time. Just before I decided to give up, it showed the error. The command to reproduce it is in the <a href="https://github.com/wb14123/jepsen-postgres-ha">readme</a>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for </span>i <span class="k">in</span> <span class="sb">`</span><span class="nb">seq </span>1 10<span class="sb">`</span> <span class="p">;</span> <span class="k">do
  </span>lein run <span class="nb">test</span> <span class="nt">--nodes-file</span> ./nodes <span class="nt">--username</span> vagrant <span class="nt">-w</span> append <span class="nt">--concurrency</span> 10 <span class="nt">--isolation</span> serializable <span class="nt">--nemesis</span> packet,kill <span class="nt">--time-limit</span> 1800 <span class="nt">-r</span> 100 <span class="nt">--nemesis-interval</span> 60 <span class="nt">--break-conn-percent</span> 0.8 <span class="nt">--cluster</span> patroni <span class="nt">--key-count</span> 1 <span class="nt">--max-txn-length</span> 1 <span class="nt">--max-writes-per-key</span> 24000 <span class="nt">--nemesis-suite</span> slow-net-kill
  <span class="nb">sleep </span>30
<span class="k">done</span>
</code></pre></div></div>

<p>Some key params:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">--time-limit 1800</code>: run the test for 1800 seconds.</li>
  <li><code class="language-plaintext highlighter-rouge">-r 100</code>: send 100 queries per second.</li>
  <li><code class="language-plaintext highlighter-rouge">--break-conn-percent 0.8</code>: 80% of the append transactions will be closed after 120ms.</li>
  <li><code class="language-plaintext highlighter-rouge">--key-count 1 --max-txn-length 1 --max-writes-per-key 24000</code>: operate on a single key until the operations exceed 24000 times.</li>
</ul>

<p>As stated above, the test slows down the network the whole time to an average of 100ms round trip. At the same time, it does this loop: wait 30 seconds. Kill the primary node by killing the k3s process and all its children processes. Wait 60 seconds. Start k3s service again.</p>

<p>At last, the outer for loop runs the command 10 times so it has a higher possibility to trigger the failure.</p>

<p>As for why to run the test 10 times instead of a single round of 300 minutes, it’s related to another problem I found during the test which will be discussed in the later sections.</p>

<p>An example of the failure from a recent run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:workload {:valid? false,
            :anomaly-types (:incompatible-order),
            :anomalies {:incompatible-order ({:key 0,
                                              :values [[4
                                                        6
                                                        8
                                                        9
// ... omitted lines of numbers
                                                        7164
                                                        7171
                                                        7176
                                                        7181]
                                                       [4
                                                        6
                                                        8
                                                        9
// ... omitted lines of numbers, the same as before
                                                        7164
                                                        7171
                                                        7176
                                                        8433]]})},
            :not #{:read-committed},
            :also-not #{:causal-cerone
                        :consistent-view
                        :cursor-stability
                        :forward-consistent-view
                        :monotonic-atomic-view
                        :monotonic-snapshot-read
                        :monotonic-view
                        :parallel-snapshot-isolation
                        :prefix
                        :read-atomic
                        :repeatable-read
                        :serializable
                        :snapshot-isolation
                        :strong-read-committed
                        :strong-serializable
                        :strong-session-read-committed
                        :strong-session-serializable
                        :strong-session-snapshot-isolation
                        :strong-snapshot-isolation
                        :update-atomic
                        :update-serializable}},
 :valid? false}
</code></pre></div></div>

<p>If you check the last 2 numbers for the 2 transactions listed above: one transaction reads <code class="language-plaintext highlighter-rouge">[..., 7176, 7181]</code> and another one reads <code class="language-plaintext highlighter-rouge">[...., 7176, 8433]</code>. It reproduced the exact problem we discussed in the last section.</p>

<p>One thing to notice is that even though the behavior of this failure is the same as we discussed in the last section, it doesn’t necessarily mean it’s the same root cause. I say that because this failure is so hard to trigger and I’m not sure yet what exactly triggered it. This is another thing I want to dig deeper into but may not have the time in the near future.</p>

<h2 id="failed-to-recover-the-cluster-when-only-1-out-of-3-nodes-is-lost">Failed to Recover the Cluster When Only 1 Out of 3 Nodes is Lost</h2>

<p>Jepsen generates some graphs after the test. So you can see how the database behaves during the test. The following graph is one of them. It’s the latency of each transaction and the time range of imported nemesis:</p>

<p><img src="/static/images/2024-12-02-PostgreSQL-High-Availability-Solutions---Part-1/failure-latency-raw.png" alt="failure-latency" /></p>

<p>The green area at the top means the time that network is slowed down, which is basically during the whole test time in our case. The red part is when the primary node is killed by our test: the lighter red part means the killed node starts to recover by the test until it reaches the white part which means it has been recovered (we mark it as recovered as soon as k3s server is recovered, so it’s not that accurate).</p>

<p>Each square/point represents a transaction. You can see even without primary node being killed, there are still some failed transactions and that’s totally normal in our test: it has errors like conflicted transactions and max connections reached etc. The thing more relevant is whether there are successful transactions, which the blue square/point represents.</p>

<p>We can see during the first few rounds when the primary get killed, the cluster can recover with 2 of 3 nodes available (the red range). However, as time passed, it can only recover when all the nodes are available (the white range). At the end of the test, there are 2 of 3 nodes available (sometimes there are all 3 nodes available because the recovery time are not always the same so the end state of the test can also be different). Just out of curiosity, I wanted to see how much time Patroni needs to recover in such state, so I left it without making the failed node available and waited. However, after 20 minutes, the cluster is still not recovered.</p>

<p>This means the cluster failed to recover even if you only lose a minority of the nodes. After I <a href="https://github.com/patroni/patroni/issues/3194#issuecomment-2496586804">asked in the Github issue</a>, I got <a href="https://github.com/patroni/patroni/issues/3194#issuecomment-2497196127">confirmed</a> that Patroni cannot be auto healed from such state. Apparently this is kind of a known behaviour but came as a surprise to me. To be fair, the behaviour of the parameters are documented but it’s hard to realize the implication. Again quoted from the description before:</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">synchronous_node_count</code>: If the parameter is set to a value higher than the number of eligible nodes it will be automatically reduced by Patroni.</p>
</blockquote>

<p>It’s not clear what “eligible nodes” means but it seems it means available nodes instead of all nodes in the cluster.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">synchronous_mode_strict</code>: When it is absolutely necessary to guarantee that each write is stored durably on at least <strong>two nodes</strong> …</p>
</blockquote>

<p>So it’s two nodes instead of a majority of the nodes: if you have a 5 nodes cluster and you happen to lose the 2 nodes with the most up to date data, it’s not possible to recover the data anymore.</p>

<p>Checking the doc again, it seems <a href="https://patroni.readthedocs.io/en/latest/replication_modes.html#quorum-commit-mode">quorum commit mode</a> is something can help here. But test with <code class="language-plaintext highlighter-rouge">synchronous_mode=quorum</code> still got the same result. And from the doc:</p>

<blockquote>
  <p>On each iteration of HA loop, Patroni re-evaluates synchronous standby choices and quorum, based on node availability and requested cluster configuration. In PostgreSQL versions above 9.6 all eligible nodes are added as synchronous standbys as soon as their replication catches up to leader.</p>
</blockquote>

<p>From the test, it seems the availability of nodes also affects the quorum.</p>

<p>Back to the question about why need to run the tests multiple times instead of running a single longer test: because the cluster doesn’t recover by itself during a node loss after a few kill loop, which means more time is needed for the healing between node kills and makes the test less efficient. Also, when wait enough time between node kills, I was not able to reproduce the read committed violation anymore. That’s another reason that I’m suspicious if it is really caused by early connection close or not.</p>

<h2 id="ways-to-improve">Ways to Improve?</h2>

<p>With a quorum based system, there is \(V_w\) means the min nodes to write for a write transaction. There is \(V_r\) means the min nodes to read for a read transaction. In order to maintain serializability, \(V_w + V_r &gt; V\) needs to be true where \(V\) is the number of <strong>all nodes</strong>, so that \(V_w\) and \(V_r\) has at least 1 node overlapped. That means when a client reads data from \(V_r\) nodes, at least 1 node has the latest data. In our case, for the normal read transactions, \(V_r\) doesn’t matter since it only reads from the primary so it’s guaranteed to have the latest data. But when doing a failover, we need to make sure having \(V_r\) nodes available because primary is lost and we need to determine which node has the latest data.</p>

<p>In the case of Patroni, with <code class="language-plaintext highlighter-rouge">synchronous_node_count</code> can be auto reduced and <code class="language-plaintext highlighter-rouge">synchronous_mode_strict</code> only guarantees data writes to at least 2 nodes, \(V_w\) is essentially set to 2 which means in order to maintain consistency, \(V_r &gt; V - V_w = V - 2\) needs to be true, which means it only tolerates 1 node lost no matter the cluster size. But even with only 1 node lost in our test above, Patroni didn’t implemented auto failover.</p>

<p>So to make it better tolerate node loss, there should be an option similar to <code class="language-plaintext highlighter-rouge">synchronous_node_count</code> but actually enforce the minimal synced replication count instead of reduce it based on node availability. And if the available nodes meets the requirement of \(V_r\), do the auto failover by comparing the largest LSN on each node.</p>

<p><em>Update: it’s actually a tricker problem to resolve if not changing PostgreSQL internals. See the section “Quorum” in my newer article <a href="/2025-08-13-Why-Consensus-Shortcuts-Fail.html">What Happens If Raft Algorithm Is Not Fully Implemented</a>.</em></p>

<h2 id="minor-issue-wrong-role-label-for-kubernetes-pod">Minor Issue: Wrong Role Label for Kubernetes Pod</h2>

<p>At last there is a minor issue but also the first issue I found during the test: in the Patroni doc, it uses the command <code class="language-plaintext highlighter-rouge">kubectl get pods -L role -o wide</code> to show the role of each Patroni pod. However, it is inaccurate as <a href="https://github.com/patroni/patroni/issues/3194#issuecomment-2439591095">confirmed in the Github issue</a>. It’s not a big deal but something to be aware of when operating Patroni. I think theoretically it may be able to be fixed by letting the primary pod set the k8s labels for all the other pods.</p>

<h2 id="whats-next">What’s Next?</h2>

<p>Ideally, I still want to dig deeper into Patroni’s test since it’s a very popular PostgreSQL HA solution. The test above is only a carefully create scenario based on a known issue. Running larger scale tests with more combined failure scenarios may be able to find more failure modes. However, because of the fundamental PostgreSQL replication flaw described above and the effort needed to run the large scale tests, I may want to setup and test another solution first.</p>

<p>The solution is what I had in mind even before I started Patroni’s Jepsen test, which is to set up replication with <a href="https://linbit.com/drbd/">DRBD</a>: instead of using PostgreSQL’s replication, DRBD replicates the whole disk instead. With modern hardware, the performance with replication overhead should be acceptable but it remains to be tested, along with the correctness of it.</p>]]></content><author><name></name></author><category term="database" /><category term="jepsen" /><category term="test" /><category term="distributed system" /><category term="consistency" /><category term="HA" /><summary type="html"><![CDATA[Note: code used in this article can be found on the Github repo jepsen-postgres-ha.]]></summary></entry><entry><title type="html">SBT Task to Build Frontend Components</title><link href="https://www.binwang.me/2024-09-13-SBT-Task-to-Build-Frontend-Components.html" rel="alternate" type="text/html" title="SBT Task to Build Frontend Components" /><published>2024-09-13T00:00:00-04:00</published><updated>2024-09-13T00:00:00-04:00</updated><id>https://www.binwang.me/SBT-Task-to-Build-Frontend-Components</id><content type="html" xml:base="https://www.binwang.me/2024-09-13-SBT-Task-to-Build-Frontend-Components.html"><![CDATA[<p>Even when writing a website using something other than Javascript to render content from the server, sometimes it’s inevitable to have some Javascript or CSS code. So managing Javascript dependencies and build packages is needed. The easiest way may be to just not use any tool: download all the dependency files into a directory and import them in the html file directly. That’s what I was doing for <a href="https://rssbrain.com">RSS Brain</a> before. But it gets messy pretty quickly and it’s hard to keep track of the dependencies. So it’s time for me to resolve the problem. Since the project is written in Scala, I’ll note down how I do it with Scala’s build tool SBT.</p>

<h2 id="frontend-package-management-and-build">Frontend Package Management and Build</h2>

<p>I put all the frontend related code into a separate sub-directory and treat it like a frontend project. This makes things much easier and less hacky. I use npm to manage the dependencies and use webpack to build it. Here is a simplified example of the code tree structure from my project <a href="https://github.com/wb14123/rss_brain_release">RSS Brain</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>▾ js/
  ▾ css/
      google-fonts.css
      main.css
      pico.jade.min.css
  ▾ dist/
      f20305dee9d396fea5c7.ttf
      f5ef242406fdcf40a232.otf
      main.css
      main.js
      main.js.LICENSE.txt
  ▾ fonts/
      google-material-icons-outlined.otf
      google-material-icons.ttf
  ▸ node_modules/
  ▾ src/
      boolean-checkbox.js
      error-handler.js
      global-htmx.js
      index.js
      match-id.js
      popover-menu.js
      register-service-worker.js
      service-worker.js
      set-theme.js
      source-images.js
    package-lock.json
    package.json
    readme.md
    webpack.config.js
▸ project/
▸ src/
  build.sbt
  LICENSE.txt
	readme.md
</code></pre></div></div>

<p>You can see other than the <code class="language-plaintext highlighter-rouge">js</code> directory, it’s a pretty standard structure for a Scala project managed by SBT.</p>

<p>When you look into the <code class="language-plaintext highlighter-rouge">js</code> directory, it’s a frontend project managed by npm and built with webpack.</p>

<p><code class="language-plaintext highlighter-rouge">js/src/index.js</code> bundles all the dependencies in node modules and local files. Here is an example:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// css</span>

<span class="k">import</span> <span class="dl">'</span><span class="s1">somment/somment.css</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">lite-youtube-embed/src/lite-yt-embed.css</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">toastify-js/src/toastify.css</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">../css/google-fonts.css</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">../css/pico.jade.min.css</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">../css/main.css</span><span class="dl">'</span><span class="p">;</span>

<span class="c1">// js</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./boolean-checkbox.js</span><span class="dl">'</span><span class="p">;</span>

<span class="k">import</span> <span class="dl">'</span><span class="s1">htmx.org</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./global-htmx.js</span><span class="dl">'</span><span class="p">;</span>

<span class="k">import</span> <span class="nx">Alpine</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">alpinejs</span><span class="dl">'</span><span class="p">;</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">Alpine</span> <span class="o">=</span> <span class="nx">Alpine</span><span class="p">;</span>

<span class="k">import</span> <span class="o">*</span> <span class="nx">as</span> <span class="nx">FloatingUIDOM</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">@floating-ui/dom</span><span class="dl">'</span><span class="p">;</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">FloatingUIDOM</span> <span class="o">=</span> <span class="nx">FloatingUIDOM</span><span class="p">;</span>

<span class="k">import</span> <span class="dl">'</span><span class="s1">lite-youtube-embed</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">@splidejs/splide</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="nx">Toastify</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">toastify-js</span><span class="dl">'</span><span class="p">;</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">Toastify</span> <span class="o">=</span> <span class="nx">Toastify</span><span class="p">;</span>

<span class="k">import</span> <span class="nx">DOMPurify</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">dompurify</span><span class="dl">'</span><span class="p">;</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">DOMPurify</span> <span class="o">=</span> <span class="nx">DOMPurify</span><span class="p">;</span>

<span class="k">import</span> <span class="dl">'</span><span class="s1">imgs-html</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">somment</span><span class="dl">'</span><span class="p">;</span>

<span class="k">import</span> <span class="dl">'</span><span class="s1">./error-handler.js</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./popover-menu.js</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./match-id.js</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./set-theme.js</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./source-images.js</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="dl">'</span><span class="s1">./register-service-worker.js</span><span class="dl">'</span><span class="p">;</span>

<span class="nx">Alpine</span><span class="p">.</span><span class="nf">start</span><span class="p">();</span>
</code></pre></div></div>

<p>Here is an example of <code class="language-plaintext highlighter-rouge">webpack.config.js</code>:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">MiniCssExtractPlugin</span> <span class="o">=</span> <span class="nf">require</span><span class="p">(</span><span class="dl">"</span><span class="s2">mini-css-extract-plugin</span><span class="dl">"</span><span class="p">);</span>

<span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="p">{</span>
  <span class="na">module</span><span class="p">:</span> <span class="p">{</span>
    <span class="na">rules</span><span class="p">:</span> <span class="p">[</span>
      <span class="p">{</span>
        <span class="c1">// If you enable `experiments.css` or `experiments.futureDefaults`, please uncomment line below</span>
        <span class="c1">// type: "javascript/auto",</span>
        <span class="na">test</span><span class="p">:</span> <span class="sr">/</span><span class="se">\.(</span><span class="sr">sa|sc|c</span><span class="se">)</span><span class="sr">ss$/i</span><span class="p">,</span>
        <span class="na">use</span><span class="p">:</span> <span class="p">[</span>
          <span class="nx">MiniCssExtractPlugin</span><span class="p">.</span><span class="nx">loader</span><span class="p">,</span>
          <span class="dl">"</span><span class="s2">css-loader</span><span class="dl">"</span><span class="p">,</span>
          <span class="dl">"</span><span class="s2">postcss-loader</span><span class="dl">"</span><span class="p">,</span>
        <span class="p">],</span>
      <span class="p">},</span>
    <span class="p">],</span>
  <span class="p">},</span>
  <span class="na">plugins</span><span class="p">:</span> <span class="p">[</span><span class="k">new</span> <span class="nc">MiniCssExtractPlugin</span><span class="p">()],</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Since this is more related to frontend tech and is very basic, I will not go too much into details. But the point is, when you run <code class="language-plaintext highlighter-rouge">npx webpack</code> under the <code class="language-plaintext highlighter-rouge">js</code> directory, it will build bundled files into <code class="language-plaintext highlighter-rouge">js/dist</code>. We will write a SBT task to trigger this command and copy the dist files into resources to package.</p>

<h2 id="sbt-task-to-trigger-build-and-package-dist-files">SBT Task to Trigger Build and Package Dist Files</h2>

<p>SBT is very flexible since you can basically write Scala code to define the tasks. Here we define the first task to install npm dependencies and trigger webpack build (in <code class="language-plaintext highlighter-rouge">build.sbt</code>):</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">lazy</span> <span class="k">val</span> <span class="nv">webpack</span> <span class="k">=</span> <span class="n">taskKey</span><span class="o">[</span><span class="kt">Unit</span><span class="o">](</span><span class="s">"Run webpack in js directory"</span><span class="o">)</span>
<span class="n">webpack</span> <span class="o">:=</span>  <span class="o">{</span>
  <span class="k">val</span> <span class="nv">workDir</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">File</span><span class="o">(</span><span class="s">"./js"</span><span class="o">)</span>
  <span class="nc">Process</span><span class="o">(</span><span class="s">"npm"</span> <span class="o">::</span> <span class="s">"install"</span> <span class="o">::</span> <span class="nc">Nil</span><span class="o">,</span> <span class="n">workDir</span><span class="o">)</span> <span class="o">#&amp;&amp;</span> <span class="nc">Process</span><span class="o">(</span><span class="s">"npx"</span> <span class="o">::</span> <span class="s">"webpack"</span> <span class="o">::</span> <span class="nc">Nil</span><span class="o">,</span> <span class="n">workDir</span><span class="o">)</span> <span class="o">!</span>
<span class="o">}</span>
</code></pre></div></div>

<p>It defines a task called <code class="language-plaintext highlighter-rouge">webpack</code>, so when you run <code class="language-plaintext highlighter-rouge">sbt webpack</code>, it will run <code class="language-plaintext highlighter-rouge">npm install &amp;&amp; npx webpack</code> under <code class="language-plaintext highlighter-rouge">js</code>.</p>

<p>Then we define another task to copy all the dist files to generated resource directory:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Compile</span> <span class="o">/</span> <span class="n">resourceGenerators</span> <span class="o">+=</span> <span class="nv">Def</span><span class="o">.</span><span class="py">task</span> <span class="o">{</span>
  <span class="nv">webpack</span><span class="o">.</span><span class="py">value</span>
  <span class="k">val</span> <span class="nv">file</span> <span class="k">=</span> <span class="o">(</span><span class="nc">Compile</span> <span class="o">/</span> <span class="n">resourceManaged</span><span class="o">).</span><span class="py">value</span> <span class="o">/</span> <span class="s">"webview"</span> <span class="o">/</span> <span class="s">"static"</span> <span class="o">/</span> <span class="s">"dist"</span>
  <span class="nv">IO</span><span class="o">.</span><span class="py">copyDirectory</span><span class="o">(</span><span class="k">new</span> <span class="nc">File</span><span class="o">(</span><span class="s">"./js/dist"</span><span class="o">),</span> <span class="n">file</span><span class="o">,</span> <span class="n">overwrite</span> <span class="k">=</span> <span class="kc">true</span><span class="o">)</span>
  <span class="nv">IO</span><span class="o">.</span><span class="py">listFiles</span><span class="o">(</span><span class="n">file</span><span class="o">).</span><span class="py">toSeq</span>
<span class="o">}.</span><span class="py">taskValue</span>
</code></pre></div></div>

<p>Here we added some steps when SBT generates resource files: first we let it run the <code class="language-plaintext highlighter-rouge">webpack</code> task we defined above, then copy all the files under <code class="language-plaintext highlighter-rouge">js/dist</code> to <code class="language-plaintext highlighter-rouge">webview/static/dist</code> under generated resources. Here resources means Java resource files, like the files under <code class="language-plaintext highlighter-rouge">src/main/resources</code>, but auto-generated to <code class="language-plaintext highlighter-rouge">target/scala-2.13/resource_managed</code> and will be packaged together as resource files.</p>

<p>So when you run <code class="language-plaintext highlighter-rouge">sbt package</code> here, the generated jar package will include all those files as resource files. For example, in my project, the generated jar package has these if you open it with vim (which can view zipped packages):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>81663 webview/static/dist/f20305dee9d396fea5c7.ttf
81664 webview/static/dist/f5ef242406fdcf40a232.otf
81665 webview/static/dist/main.css
81666 webview/static/dist/main.js
81667 webview/static/dist/main.js.LICENSE.txt
</code></pre></div></div>

<h2 id="serve-resource-files-in-http-server">Serve Resource Files in Http Server</h2>

<p>Now you can serve the files under <code class="language-plaintext highlighter-rouge">webview/static/dist</code> with your web server. Different web servers or frameworks do it differently. Here is an example of http4s:</p>

<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// include the following route into the http4s web server</span>
<span class="c1">// IMPORTANT: every resource file under `/webview` will be publicly accessible</span>
<span class="k">val</span> <span class="nv">assetsRoutes</span> <span class="k">=</span> <span class="n">resourceServiceBuilder</span><span class="o">[</span><span class="kt">IO</span><span class="o">](</span><span class="s">"/webview"</span><span class="o">).</span><span class="py">toRoutes</span>
</code></pre></div></div>

<p>Then you can use them in HTML:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;link</span> <span class="na">rel=</span><span class="s">"stylesheet"</span> <span class="na">href=</span><span class="s">"/static/dist/main.css"</span><span class="nt">&gt;</span>
<span class="nt">&lt;script </span><span class="na">src=</span><span class="s">"/static/dist/main.js"</span> <span class="na">defer=</span><span class="s">"defer"</span><span class="nt">&gt;&lt;/script&gt;</span>
</code></pre></div></div>]]></content><author><name></name></author><category term="Scala" /><category term="SBT" /><category term="Javascript" /><category term="CSS" /><category term="frontend" /><category term="webpack" /><category term="npm" /><summary type="html"><![CDATA[Even when writing a website using something other than Javascript to render content from the server, sometimes it’s inevitable to have some Javascript or CSS code. So managing Javascript dependencies and build packages is needed. The easiest way may be to just not use any tool: download all the dependency files into a directory and import them in the html file directly. That’s what I was doing for RSS Brain before. But it gets messy pretty quickly and it’s hard to keep track of the dependencies. So it’s time for me to resolve the problem. Since the project is written in Scala, I’ll note down how I do it with Scala’s build tool SBT.]]></summary></entry><entry><title type="html">My MacOS Essentials</title><link href="https://www.binwang.me/2024-08-07-My-MacOS-Essentials.html" rel="alternate" type="text/html" title="My MacOS Essentials" /><published>2024-08-07T00:00:00-04:00</published><updated>2024-08-07T00:00:00-04:00</updated><id>https://www.binwang.me/My-MacOS-Essentials</id><content type="html" xml:base="https://www.binwang.me/2024-08-07-My-MacOS-Essentials.html"><![CDATA[<p>As a long time Linux and KDE user, I’m pretty uncomfortable with the workflow of MacOS even though I have used MacOS fairly long as well. A lot of companies don’t support Linux to be used on the development laptop. Even for some companies that do support Linux, the hardware for Linux is usually far worse than Macbooks. So MacOS is often the best or even the only choice for work. This is still the case for my new job. I think it’s a good opportunity to write a blog about my MacOS setup. This can be a note for myself when I need to setup a fresh MacOS again in the future.</p>

<h2 id="my-complaint-about-macos-desktop-environment">My Complaint about MacOS Desktop Environment</h2>

<p>Everyone has different taste and needs about desktop environment and I respect that. The following is just based on my own preference. If you happen to have the same pain points, the setup may help you. Otherwise I find it’s pretty inspiring to see how other people work as well even though I may never work like that.</p>

<p>I mostly just use these apps for work:</p>

<ul>
  <li>A terminal. I use iTerm2 for this. I usually use tmux to manage “windows” in terminal so I usually don’t open multiple iTerm2 windows.</li>
  <li>IDE. Usually Intellij Idea or other JetBrain family products.</li>
  <li>Browser: Firefox.</li>
  <li>Team collaboration software like Slack and Zoom.</li>
</ul>

<p>Most of those software are cross platform so I don’t have much complaint about the software themselves. The things I want to change are on the desktop environment itself.</p>

<p>There is a thing in MacOS that I wouldn’t be used to in a million years: the logic of windows grouping for the same app. It results in these problems:</p>

<p>First, it needs different keyboard shortcut when switching through windows. It just adds unnecessary complexity. Especially with my HHKB keyboard, the <code class="language-plaintext highlighter-rouge">~</code>/<code>`</code> key is far away from Tab key: it’s at the top right corner. And it’s hard to see from a glance what windows are available.</p>

<p>Talking about seeing what windows are available, the dock doesn’t do a good job as well. You can only see which apps are open. And I don’t feel it’s doing a good job even for that. Usually I just end with lots of opened windows/apps that’s no longer needed and it’s hard to keep track of them without a proper panel that shows all the windows.</p>

<h2 id="make-it-more-kdewindows-like">Make It More KDE/Windows Like</h2>

<p>So my goal here is to make it more KDE/Windows like, which means:</p>

<ul>
  <li>Use the same keyboard shortcut to cycle through all the windows, do not group the windows by app.</li>
  <li>Have a panel that shows all the windows. Again, do not group by app.</li>
  <li>This is a good to have: use keyboard to snap windows on the left/right or maximize.</li>
</ul>

<p>I don’t need the “start menu” since I usually just open apps by bringing up the searchable launcher: Spotlight in MacOS and KRunner in KDE.</p>

<p>So here is a list of software that achieves my needs:</p>

<ul>
  <li><a href="https://alt-tab-macos.netlify.app/">AltTab</a>: cycle through all the windows without grouping by app.</li>
  <li><a href="https://rectangleapp.com/">Rectangle</a>: Windows snap and keyboard shortcuts</li>
  <li><a href="https://ubarapp.com/">uBar</a> or <a href="https://sidebarapp.net/">sidebar</a>: KDE/Windows like panel bar to show all windows.</li>
</ul>

<h2 id="other-quality-of-life-improvements">Other Quality of Life Improvements</h2>

<p>There are two other software I find very useful even though they are not related to the workflow above.</p>

<p>First, <a href="https://github.com/tombonez/noTunes">noTunes</a>. It bans the start of Apple Music. I find it’s very annoying that when I accidentally pressed some button or touched my Airpods, the Apple Music popped up. I don’t even know what triggered it. So this software solves this problem perfectly.</p>

<p>The second one is <a href="https://karabiner-elements.pqrs.org/">Karabiner-Elements</a>. This is a very powerful custom key mapping software. But I mainly use it to support two keyboards at the same time. That is a very very niche personal need: I use two same keyboards as split keyboard. I can write more on that in the future blogs. But the point is, MacOS doesn’t support two keyboards at the same time very well and this software solves that.</p>]]></content><author><name></name></author><category term="MacOS" /><category term="tools" /><category term="software" /><category term="desktop environment" /><summary type="html"><![CDATA[As a long time Linux and KDE user, I’m pretty uncomfortable with the workflow of MacOS even though I have used MacOS fairly long as well. A lot of companies don’t support Linux to be used on the development laptop. Even for some companies that do support Linux, the hardware for Linux is usually far worse than Macbooks. So MacOS is often the best or even the only choice for work. This is still the case for my new job. I think it’s a good opportunity to write a blog about my MacOS setup. This can be a note for myself when I need to setup a fresh MacOS again in the future.]]></summary></entry><entry><title type="html">Source Code of RSS Brain is Available</title><link href="https://www.binwang.me/2024-07-27-Source-Code-of-RSS-Brain-Is-Available.html" rel="alternate" type="text/html" title="Source Code of RSS Brain is Available" /><published>2024-07-27T00:00:00-04:00</published><updated>2024-07-27T00:00:00-04:00</updated><id>https://www.binwang.me/Source-Code-of-RSS-Brain-Is-Available</id><content type="html" xml:base="https://www.binwang.me/2024-07-27-Source-Code-of-RSS-Brain-Is-Available.html"><![CDATA[<p><em>This article is also posted at <a href="https://news.rssbrain.com/news/2024/07/26/Source-Code-Released">RSS Brain blog</a>.</em></p>

<p>When I first published <a href="https://www.rssbrain.com/">RSS Brain</a>, I promised the source code will be released (well, I actually said “open source”, but more on that later). After I rewrote the whole Flutter frontend with Javascript, most code is put into a single source repo. I feel comfortable to release it. So here it is on <a href="https://github.com/wb14123/rss_brain_release">Github</a>.</p>

<p>There are two things you may notice from the source code:</p>

<ul>
  <li>The commit history is mostly missing.</li>
  <li>The code license is not an open source license.</li>
</ul>

<p>I’ll talk about the most important one first: the code license.</p>

<h2 id="code-license">Code License</h2>

<p>RSS Brain’s source code is released under <a href="https://github.com/wb14123/rss_brain_release?tab=License-1-ov-file#readme">SSPL</a>, Server Side Public License. I don’t want to use “open source” as a market point for RSS Brain so I must make this clear first: technically, RSS Brain is a <a href="https://en.wikipedia.org/wiki/Source-available_software">source available software</a>, not an open source one, since SSPL is not recognized as an open source license.</p>

<p>SSPL is mostly the same as <a href="https://www.gnu.org/licenses/agpl-3.0.en.html">AGPL v3</a>, but with a key difference: it requires the user to release the source code of the whole stack if the project is used commercially. If you want to run the code on your own server and use RSS Brain freely, it’s all good. You can even share your server with family and friends. But as long as you start to charge money for that service, you need to release the source code of everything you use for the service, including things like OS, CI/CD, web server and so on. So it basically makes it impractical to use the source code commercially. I chose that feature on purpose.</p>

<h2 id="the-purpose">The Purpose</h2>

<p>Before I explain why I chose this license, I must explain the reasons of making RSS Brain’s source code available.</p>

<h3 id="transparent-algorithm">Transparent Algorithm</h3>

<p>In a past blog post <a href="https://www.binwang.me/2020-08-02-What-Is-Wrong-abount-Recommendation-System.html">What Is Wrong about Recommendation System</a>, I mentioned I don’t want to be manipulated by recommendation systems. And that’s one of the main motivations for me to start writing my own RSS reader. While there are still ranking and recommendation algorithms in RSS Brain, it tries to provide better information instead of making the user more addicted to the product. In order to prove that, the algorithm needs to be available so that the users can inspect it and decide whether it’s the right one for them.</p>

<p>Be aware even the source code is available, it still needs some level of trust since the code running on my server <a href="https://app.rssbrain.com">app.rssbrain.com</a> can theoretically be different from what is being released. But it’s good enough for most people. However, if you want absolute control, you can always run it on your own server with the source code available.</p>

<h3 id="no-vendor-lock-in">No Vendor Lock-in</h3>

<p>Another important benefit is the user can expect the software to last. Even if I don’t host the service anymore, there is always a way to continue using it since the code is available. Yes it’s not commercial friendly, but if the software turned out to be really useful and attracted enthusiasts, I believe someone else will continue to maintain it for free. I think this property is critical for any product that needs to be used every day and becomes an important part of the digital life.</p>

<h3 id="no-free-commercial-usage">No Free Commercial Usage</h3>

<p>The next benefit is not for the user, but for myself. I want the <strong>users</strong> be able to self host the service for free, but I don’t want other people take my source code for free and earn money from it. I think that’s reasonable, especially considering I also provide paid hosted solution at <a href="https://app.rssbrain.com">app.rssbrain.com</a>.</p>

<h3 id="considerations-of-contributors">Considerations of Contributors</h3>

<p>One big advantage of open source project is it can attract contributors to make the software better. And it can sometimes justify the free commercial usage because all the competitors are contributing to the software. But because this is a software I am and will use daily, I want to have 100% control of its roadmap. Not only the product aspect, but also technology aspect. It’s just easier to write all the things by myself, at least for now. So I’m fine to choose a non open source license even with less potential contributors.</p>

<h2 id="release-process">Release Process</h2>

<p>You may notice the source code has very few commit history. The release process will be only one commit for each release. The release cycle will be one release every few weeks, depends on how much progress I make. The regular releases will mostly on the weekend. If there is a bug or a security risk, the release may be more frequent.</p>

<p>The version number is in the format of <code class="language-plaintext highlighter-rouge">X.Y.Z</code>. Where Y will be increased for every feature release and Z will be increased for every bug fix. X will only be increased for breaking changes or really major update.</p>

<p>I’ll make my hosted version at <a href="https://app.rssbrain.com">app.rssbrain.com</a> the same as the source code. Which means at each release, I’ll update the app first, and release the source code just after it. I’ll add a section in the app’s setting page to indicate the current version.</p>

<p>The reason I chose this release mode is the same reason as I released it under SSPL. I only want the source code be available to users, but I don’t really care about whether there will be contributions from other people. So hiding the commit history between releases just makes my life easier since I don’t need to care too much about keep my commit messages clean.</p>

<h2 id="roadmap">Roadmap</h2>

<p>With this source code released, everyone can inspect the algorithm to decide if this is the right product for them. However, for self-host, even if you can do it right now, it requires some undocumented configuration. So I’ll do the following things to make it easier:</p>

<ul>
  <li>Add documents for self-host.</li>
  <li>Add documents for admin operations like create admin users.</li>
  <li>Disable some components by default. To name a few:
    <ul>
      <li>There is an machine learning server mentioned in <a href="https://www.binwang.me/2023-11-14-Update-On-RSS-Brain-to-Find-Related-Articles-with-Machine-Learning.html">this blog</a>. I will likely disable it by default since I’m thinking about redo this part in the short future.</li>
      <li>Payment is not needed for self-hosted instances so I’ll disable it by default.</li>
      <li>There is an image proxy that I’ll likely to disable as well, just to make it easier to deploy.</li>
    </ul>
  </li>
</ul>]]></content><author><name></name></author><category term="RSS Brain" /><category term="open source" /><category term="source available" /><category term="project" /><category term="software engineer" /><summary type="html"><![CDATA[This article is also posted at RSS Brain blog.]]></summary></entry><entry><title type="html">A Review of Linux on Surface Pro 4</title><link href="https://www.binwang.me/2024-07-12-A-Review-of-Linux-on-Surface-Pro-4.html" rel="alternate" type="text/html" title="A Review of Linux on Surface Pro 4" /><published>2024-07-12T00:00:00-04:00</published><updated>2024-07-12T00:00:00-04:00</updated><id>https://www.binwang.me/A-Review-of-Linux-on-Surface-Pro-4</id><content type="html" xml:base="https://www.binwang.me/2024-07-12-A-Review-of-Linux-on-Surface-Pro-4.html"><![CDATA[<h2 id="background">Background</h2>

<p>I bought a Surface Pro 4 in 2016. It has an Intel Core m3-6Y30 CPU and 4GB memory. The spec is not that impressive even compared to an average laptop released years earlier. On the other hand, the form factor is very attractive to me: at a very low price, you get a tablet with a beautiful HiDPI 2k screen, a pressure sensitive stylus and a usable keyboard. It is on the heavier side if used as a tablet, but compared to other laptops, it’s very light. It served me very well for my limited use cases. The blog <a href="/2016-11-28-Config-Development-Environment-on-Windows.html">Build a Unix Like Environment on Windows</a> was written in that era. Some years later, I bought a more powerful laptop when I needed to work while traveling. So I gave the Surface away to a family member.</p>

<p>However, during the past years, I couldn’t stop thinking about having a Linux tablet. At first I checked <a href="https://pine64.org/devices/pinetab/">Pinetab</a>, then I realized I had a Surface which would be perfect if I could install Linux on it. I searched online and found some successful stories. So when I <a href="/2024-03-19-Travel-Back-to-China.html">travelled back to my hometown</a> at the beginning of this year, I brought the Surface back with me and started to experiment with it.</p>

<h2 id="use-cases">Use Cases</h2>

<p>Before I go further, I need to mention my intended use cases:</p>

<ul>
  <li>Browse Internet. Mainly <a href="https://www.rssbrain.com/">RSS Brain</a>, the RSS reader I built by myself.</li>
  <li>Media consumption: watch videos from my Samba share and online websites like Youtube.</li>
  <li>PDF reading: reading only is enough for me but it’s better if I can take notes in the PDF.</li>
  <li>Sketches: I don’t have a habit to do handwriting notes even in my student era. Nowadays it’s more efficient and readable/searchable to take text notes with Markdown. However, I do like drawing sketches on paper when brain storming or resolving some hard problem. Moving it to digital has a lot of benefits if it works.</li>
  <li>Drawing: this is a good to have feature. I don’t really have needs to draw things but it’s always fun. Especially with the development of AI, if I draw something and send it to a more powerful machine to generate images, it could open doors to many possibilities.</li>
</ul>

<h2 id="installation">Installation</h2>

<p>The installation of Linux is actually very easy. I tried two distros and the installation process went very smooth for both of them. The distros I tried are <a href="https://endeavouros.com/">EndeavourOS</a> and Fedora workstation 40.</p>

<p>The installation steps are well documented in <a href="https://github.com/linux-surface/linux-surface/wiki/Installation-and-Setup#installation">linux-surface’s wiki</a>. <a href="https://github.com/linux-surface/linux-surface">linux-surface</a> is the Linux kernel and tools for Surface devices. The wiki page has its installation steps as well.</p>

<p>In general, if only used as a laptop, the experience is almost perfect even without the linux-surface kernel. But using it as a tablet is another story.</p>

<h2 id="what-works">What Works</h2>

<p>Let’s talk about what works first. Even without linux-surface kernel, almost everything works except touch screen and stylus. That includes things like wireless network, bluetooth, keyboard, power profile, UI scaling for Hi-DPI and so on. Multi-touch and pressure sensitive stylus work as well (sort of, see sections below) after installing the linux-surface kernel. Battery life is good enough: about 5-6 hours of light usage like web browsing, PDF reading, and about 3 hours of video watching. (Just some estimated time from my experience, no serious benchmarking was done).</p>

<p>On the software side, automatic screen rotation is enabled on both distros I tried. KDE with EndeavourOS is very fast and responsive. When the keyboard is detached, it enters tablet mode which makes some UI larger and more user friendly with touch gestures. For example, you can just touch on a folder to open it in Dolphin instead of double-click it.</p>

<p>For Gnome, it’s less responsive than KDE but the UI is really beautiful when used as a tablet. I was never a fan after Gnome 3 but I guess the UI changes it made makes more sense on a tablet than on a laptop or a desktop. The overall layout really reminds you about the iPad or Android tablet (in a good way), but with the power of a real desktop OS at the same time. I would really like it if it uses less resource.</p>

<p>Even though the overall experience is positive and has the potential to meet all my use cases, one serious problem made it very unusable and made me give up Linux on Surface at the end.</p>

<h2 id="the-problems-in-both-distros">The Problems in Both Distros</h2>

<p>The deal breaker problem is touch recognition. The problem is in the surface-linux tools so it affects all the distros. The biggest problem is ghost touch: touches are registered randomly even when I do nothing. I tried a lot of workarounds including the ones mentioned in <a href="https://github.com/linux-surface/linux-surface/wiki/Surface-Pro-5">linux-surface’s wiki page</a>, but none of them actually resolved it completely. Sometimes it’s fixed after reboot but reappeared after next reboot. Sometimes it gets fixed for a period of time but reappeared after a system upgrade. Sometimes the touch screen doesn’t work at all after resume from sleep. The randomness and the seriousness of the problem is really annoying so I gave up using it with Linux at last.</p>

<p>Other than the ghost touching, another big problem about touch recognition is palm rejection. It’s really annoying when drawing things with the pen. In iptsd (surface-linux’s daemon for touch recognition), there is a configuration to disable touch screen when using a pen but it doesn’t work well. So it makes drawing very unusable.</p>

<p>Both KDE and Gnome have virtual keyboards when the physical keyboard is detached, and works most of the time despite the problems I’ll mention in the following sections. But if you have set up disk encryption with a password, there is no virtual keyboard when you input the disk password, so a physical keyboard is always needed during the boot. Which can be annoying but not really a deal breaker.</p>

<p>The last big problem is battery drain during sleep. It uses about 30% battery for one night even it has been put into sleep. I had similar issues for other laptops. I believe there may be some configurations I can tune to fix that. But after I gave up Linux on it because of the ghost touch, I didn’t dig deeper into that.</p>

<p>Other than the problems shared by both distros, each distro/desktop environment also has their own problems.</p>

<h2 id="the-problems-in-kde-with-endeavouros">The Problems in KDE with EndeavourOS</h2>

<p>The biggest problem in KDE other than the ones I talked above, is the virtual keyboard. It’s buggy and not very stable. Sometimes it keeps popping up and sometimes it doesn’t show up. It’s annoying especially at the login screen: if it’s not popped up you will still need a physical keyboard, which prevents it to be a real tablet. Sometimes when the keyboard is popped up, the panel at the bottom cannot be touched. The bugs happened randomly that makes it hard to be properly reported.</p>

<p>Another problem is the touch gesture for right click. Naturally, with a touch screen, long press should be treated like a right click. But that is not the case for KDE. So a lot of operations just cannot be done without a mouse when you need a right click.</p>

<p>Resize a window is also very tricky with touch only operation: you need to touch on the border precisely on the first try.</p>

<p>At last, the scroll behavior is not very smooth. It makes me a little bit dizzy just by scrolling through web pages and PDFs.</p>

<p>So I thought I’d give another distro and desktop environment a try, to see if they can resolve my problems.</p>

<h2 id="the-problems-in-gnome-with-fedora-workstation-40">The Problems in Gnome with Fedora Workstation 40</h2>

<p>I chose Fedora because it comes with Gnome, and I had good experience with it before. After the installation, the first impression is it’s much slower than KDE with EndeavourOS. I found it enables swap and ZRam by default so I disabled them. It’s better but still slower than KDE. It uses more memory at around 40-50% percentage while idle. And I got a lot of OOM kills which almost never happened with KDE on EndeavourOS.</p>

<p>Maybe because of the slowness, it’s also buggy for lots of operations. For example, when switching to the workspace view from PDF viewer with 4 fingers swipe up, the PDF keeps scrolling at the background. And when scrolling in the file manager, the context menu keeps popping up.</p>

<p>Other than the slowness, there is a problem on the virtual keyboard as well: the backspace key doesn’t work properly. I found a workaround by installing a third-party Gnome addon, but sometimes the old keyboard still popped up.</p>

<h2 id="go-back-to-windows-10">Go Back to Windows 10</h2>

<p>I’d say if the touch recognition works well enough, all the other problems are acceptable with KDE. But with those problems, I finally decided to fallback to Windows 10 again. It works well enough, just as I remembered from years ago. However I abandoned OneNotes and some other Microsoft products and use the following software instead:</p>

<ul>
  <li>Firefox as the browser.</li>
  <li>Nextcloud to sync the files.</li>
  <li>Samba for video sharing.</li>
  <li>Built in video player for local video playing.</li>
  <li>Krita for drawing and sketches.</li>
  <li>Drawboard PDF for PDF reading.</li>
</ul>

<p>It’s pretty disappointing that this device cannot be used with Linux properly. But using Windows is still better to just let the device sitting there doing nothing. Maybe I will re-evaluate it after Windows 10 is end of life next year.</p>]]></content><author><name></name></author><category term="Linux" /><category term="Surface" /><category term="Microsoft" /><category term="Operating system" /><category term="tech" /><summary type="html"><![CDATA[Background]]></summary></entry></feed>