By Angela Zhu and Shalini Oruganti
Coinbase’s mission is to create an open financial system for the world. The Coinbase Payments team’s mission is to empower customers to move money in and out of the crypto economy with a delightful and flawless experience. Coinbase currently supports 10+ different payment methods in over 30 countries and we are building more. In this blog, we will share some of the main challenges and best practices for payment systems from an engineering perspective.
Payments are one of the areas that have zero tolerance for any errors. Ensuring the product flows and features work as expected is of the utmost importance. Any payment bugs that are related to correctness would cause an unacceptable customer experience. When an error occurs it needs to be corrected immediately. Further, the process to remediate such mistakes is time consuming, and usually is complicated due to various legal and compliance constraints.
In our systems, we have built multiple tiers to ensure correctness. These span from unit testing in implementation, production test/bug bash for any feature update or flow changes, monitoring on various error rates, authorization rates, and success rates; to anomaly detection and alerting set up to capture anything that could go wrong as regression due to new changes. Close support with the product loop also helps surface any correctness related issues.
Other than logical correctness, the correctness of system behavior could also be expanded to how exceptions are handled. We discuss some of these concepts in the following sections.
The second important aspect of correctness is how resilient the system is to external issues and bugs. For example, one of the most important concepts in the payments domain is called idempotency. This is necessary because if there is a retry initiative for any failed transaction, we must ensure the retry doesn’t result in any type of double charge.
Usually, an end-to-end payments system would span the client-side, to the backend services, to the external partners where the payments transactions are handled on the backside. All transactions must be kept as atomic as possible. But some client-to-service or internal-to-external requests could be long, especially in timeout or failure cases, and we can only confirm the final results (success/failure) after minutes or hours later. So in some of those cases, we will initiate retries from upstream to downstream. If the whole end-to-end is not handling retry properly, i.e., the system is idempotent, it is inevitable to get into a situation of processing the same transaction twice, thus causing double charge or double payout.
Once the idempotent quality is ensured, we also need to make sure to have the right design in place for auto-retry and user messaging, etc.
Another important thing to consider when having multiple layers from upstream to downstream is the data record. i.e., how we design data models, data recording, and propagation to ensure if any issues arise, we can do our best to recover the system state and trace what happened.
Payments always use both cached data for speed and persistent data for recoverability. Whenever there is caching, then it is important to have the right strategy to guide as to when to write to which data layer. I.e. how we do data propagation when there is transient disagreement, how to identify the source of truth, and how we design the whole recovery process to ensure eventual consistency.
Another key to capturing data properly is to keep a reliable record such that we can always trace what exactly happened. This is needed in different contexts including financial auditing, event logging, issue investigation need, etc.
When it comes to customer experience, the first thing users care about is whether the service is available for them to use. But the technical stack of a payment system consists of multiple layers. We therefore try to add as much redundancy as possible by duplication of critical components to increase reliability of our systems.
Another important aspect of an international payments system is geographical coverage. The speed at which we can add new payment methods to new jurisdictions is crucial. To accelerate integration speed, it is important to have the right abstractions and abstraction layers to capture but also hide specific details. For example, a well-designed abstraction is when it can handle both push payments and pull payments; be used to represent both pay-in and payout; charge and refund; sync payment and async payment, etc.
Keeping the payment systems maintainable and scalable is of the utmost importance. The KISS principle states “Wherever possible, complexity should be avoided in a system — as simplicity guarantees the greatest levels of user acceptance and interaction.” This principle is especially critical when it comes to payment system design. Any over-complicated logic or knotty code can cause mysterious bugs in the future.
We also lean towards maintaining high-quality runbooks and documents to capture all design considerations and tradeoffs. In our experience, the same design choices can become debatable in the future and for this reason, documentation is invaluable. Most of the design patterns in our systems are dependent and interact with each other. Each of these components are critical to completing the system. Having full documentation helps new people understand, ramp up, and align with the overall design methodologies.
Although precision is important for building reliable payment systems, we must also look beyond. Empowering customers to move money with a delightful experience is more than just making the transactions safe and correct. End-to-end payment systems are complex and need to incorporate compliance, security, fraud, and other factors. This blog only touches on some of the basic and high-level concepts. However in the future we will share more articles discussing in-depth components of our payment systems.
Check out the original article here.