Computer Science

Algorithm

任意进制数的补码 (2011)

快排杀手 (2011)

How to Estimate Max TPS from TPM (2020)

Data Processing

Use Docker to Submit Spark Jobs (2015)

The Proper Way to Use Spark Checkpoint (2015)

Use Redis Instead of Spark Streaming to Count Statistics (2015)

Digital Life

Move from Twitter to Mastodon (2020)

What Is Wrong about Recommendation System (2020)

Use RSS and Kindle to Read News (2020)

Matrix: A Self Hosted Instant Messaging Solution with End to End Encryption (2020)

An Overview of China's Internet Censorship Strategy (2020)

Deploy Matrix for Users in China (2020)

Random Playlists for Self Hosted Videos (2024)

Distributed System

Spanner and Open Source Implementations (2018)

Great Resources for Learning Database and Distributed System (2019)

Aurora Database (2020)

Understand Liveness and Fairness in TLA+ (2020)

Use TLA+ to Verify Cache Consistency (2020)

Keep Data Consistency During Database Migration (2020)

Redis Implementation for Cache and Database Consistency (2020)

Jepsen Test on Patroni: A PostgreSQL High Availability Solution (2024)

Distributed System Infrastructure

Infrastructure Setup for High Availability (2023)

Upgrade Kubernetes from 1.23 to 1.24 (2023)

How to Cleanup Ceph Filesystem for Deleted Kubernetes Persistent Volume (2023)

Introduce K3s, CephFS and MetalLB to My High Avaliable Cluster (2023)

Replace A Dead Node in My High Available Cluster (2025)

Machine Learning

Backpropagation Algorithm (2015)

My Recent Work About Neural Networks (2015)

Install BLAS Library for MXNet (2015)

How to Put RNN Layers Into Neural Network Model (2016)

Build A Computer for Deep Learning (2016)

Use OpenAPI Instead of MCP for LLM Tools (2025)

Operating System

Android

The Permission Management of Android Becomes A Bigger Problem When It Comes to Wearable Devices and TV (2016)

Root And Optimize MiBox 3S (2018)

Linux

端口666：毁灭！ (2011)

Chroot 简介 (2012)

Compile And Install Kernel (2012)

Backup My Dotfiles (2012)

Comparison Between Linux Desktop Environments (2012)

How Kernel's Makefile Specify Output Directory (2013)

Xbmc on Raspberry Pi with Archlinux (2013)

Setup SSH Authentication with YubiKey (2021)

In Defence of Disabling Swap (2021)

Build a Linux Virtual Machine for Windows Apps (2023)

Linux Full Disk Encryption with Yubikey (2023)

A Review of Linux on Surface Pro 4 (2024)

MacOS

My MacOS Essentials (2024)

Tizen

Tizen，加油 (2012)

Windows

Build a Unix Like Environment on Windows (2016)

iOS

DNS Resolving Bug in iOS 14 (2020)

Handle Apple In-App-Purchase Server Notification with Scala/Java (2022)

Programming Language

C++

也谈C++ (2012)

Erlang

Build Erlang the Rebar Way (2013)

Fetch Popular Erlang Modules by Coffee Script (2013)

Why I Come Back to Erlang (2014)

Experiment On Combining OOP With Erlang's Actor Model (2014)

Go

Notes On Go Scheduler (2014)

Scala

Config sbt to Use Both Proxy and Self Hosted Repositories (2016)

Compare Task Processing Approaches in Scala (2023)

A Boring JVM Memory Profiling Story (2023)

Scala 2 Macro Tutorial (2023)

SBT Task to Build Frontend Components (2024)

Scheme

SICP第三章总结（上）——可变量与环境 (2012)

SICP第三章总结（下）——流编程 (2012)

Type System

RESTful API with Type System (2014)

Powerful Type System (2020)

Software Engineering

Call Program Like A Function (2012)

More About Program In Shell And Function (2013)

Server Logic of Level Based Games (2013)

Languages Should Have Database Built In (2013)

How About Translate IMAP And SMTP Into HTTP API? (2015)

The Things You Need to Know When Using Apache Sentry (2018)

Define Infrastructure as Code (2021)

Why Big Companies Need to Adopt Open Source (2021)

Storage

Change Root File System from Ext4 to Xfs on Archlinux (2013)

Migrate Arch Linux to ZFS (2020)

Personal ZFS Offsite Backup Solution (2021)

ZFS Profiling on Arch Linux (2023)

UI

Flutter

Make Flutter Web Apps More Native Like (2024)

Javascript

Beautiful Math with MathJex (2012)

HTML + CSS + JS is Good (2014)

What is Wrong about HTML and CSS (2014)

Prevent htmx Lazy Loaded Content From Reloading (2024)

Create a Checkbox That Returns Boolean Value for htmx (2024)

Virtualization

Create A Virtual Machine Network (2012)

Fedora Virt-manager Guest Connect to Host (2013)

Docker Is the One Scaffolding to Rule Them All (2014)

Life

Life in Guangzhou (2013)

Recent Works (2013)

东京之旅 (2014)

My 2017 Year in Review (2018)

My 2020 in Review (2021)

十三年前被隔离的经历 (2022)

A Travel to Montreal (2022)

My 2022 in Review (2023)

Travel Back to China (2024)

A 2-Year Reflection for 2023 and 2024 (2025)

Travel Back To China: 2025 Edition (2025)

Projects

Bard

The Thoughts Behind Bard Framework (2014)

Why Use Reflections to Write A Web Framework (2014)

Blog

My New Blog Website (2012)

Comment And Search Are Available (2012)

Remove Categories (2012)

Add Index to My Blog (2021)

Jekyll Plugin to Load Asciinema Recordings Locally (2023)

Add Index Sidebar to My Blog (2023)

RSS Brain

RSS Brain: Yet Another RSS Reader, With More Features (2022)

How RSS Brain Shows Related Articles (2022)

Update on RSS Brain to Find Related Articles with Machine Learning (2023)

Source Code of RSS Brain is Available (2024)

Scala2grpc

A Library to Make It Easier to Use Scala with gRPC (2022)

Migrate Scala2grpc to Cats Effect 3 (2023)

Comment Everywhere (2013)

Fetch Popular Erlang Modules by Coffee Script (2013)

Psychology

耶鲁大学心理学导论 (2012)

Thoughts

Chinese

关于人的思想 (2008)

关于人的思想（续） (2008)

览《中国文化要义》有感 (2011)

未来人们怎样对待坏人：读《理想国》杂想 (2011)

好玩的生命游戏 (2011)

念天地之悠悠 (2012)

摒弃现代科技的隐士生活 (2015)

读《邓小平时代》有感 (2016)

由“废青”这个称呼所想到的 (2019)

盛唐诗人和远游 (2020)

English

Tired of Programming (2013)

The Tragic Talented Programmer (2020)

The Proper Way to Use Spark Checkpoint

Posted on 03 Nov 2015, tagged spark

These days I’m using Spark streaming to process real time data. I’m using updateStateByKey, so I need to add checkpointing, which is a fault tolerance mechanism of Spark streaming. The checkpoint will save DAG and RDDs. So when you restart the Spark application from failure, it will continue to compute.

But there is a problem with checkpointing: you cannot load the checkpointed data once you change the class structure of your code, so the state in updateStateByKey is lost. This is a pretty big limit. Another solution is to save and load data by ourself, but in this way checkpointing is totally useless and will also break the fault tolerance. What about to use both ways? Then the data may load twice while the application is auto restarted by the Spark cluster, in the case of failure. So I asked this question in the Spark user list and somebody kindly give me a solution: use updateStateByKey with the parameter initialRDD.

The answer is a little simple, so I will explain it here. This way is to use both checkpointing and our own data storage mechanism. But we load our data as the initalRDD of updateStateByKey. So in both situations, the data will neither lost nor duplicate:

When we change the code and redeploy the Spark application, we shutdown the old Spark application gracefully and cleanup the checkpoint data, so the only loaded data is the data we saved.
When the Spark application is failure and restart, it will load the data from checkpoint. But the step of DAG is saved so it will not load our own data as initalRDD again. So the only loaded data is the checkpointed data.