We are in the business of moving money. A Payment Transaction transcends multiple services within our stack. There are multiple failure points and the ability to react to a failure and recover from the failure will be critical to ensure higher availability of the systems to merchant partners and customers. The ability to trace and monitor a transaction is supercritical from an operational perspective and for the stakeholders involved. Since it’s impractical to avoid exceptions due to multiple services and stakeholders in the path of a payment transaction, a reliable and reactive observability stack with the ability to monitor and alert these exceptions in near real-time with help the operations to identify issues in the external system like outages, issues due to incremental code rollouts and take corrective actions. The stack should provide the ability to trace the transaction as it flows through multiple services.
- Functional Observability involves business context monitoring in the context of payment, including approval rates, decline rates, a spike in declines, missing files, exceptions during file processing, etc.
- Business errors that occur when new functionality is rolled out to Production and the ability to compare the Approval rates before and after the code rollouts should be realized using the Observability stack in place
- Telemetry data includes the flow specific parameters logged from each service and the same data aggregated across different parameters.
Approval rates across a specific Merchant
Declines rates across a specific Issuer
Count of specific Declines like Expired Card, Generic Decline etc.
It gives the ability for the Developers to debug at per-transaction level
Ability to trigger an alert for business failures like a spike in declines
Non — Functional Observability
- Non-functional monitoring includes the following at the host level and aggregated at the system level (application server, serverless, DB, etc.)
o Memory Usage
Cornerstones of Pine Lab’s Observability Stack
- Functional Monitoring through the ELK Stack
Business Context data for each transaction are logged at the per-transaction level
Metrics are extracted from Log files using Logstash and published to Kafka
Data flows through Kafka to the Elastic Cluster
Data is visualized using Kibana
- Telemetry data from the application is sourced from the application and sent to the sinks in a vendor-agnostic way using the open telemetry APIs
- Functional alerts were set up for the business failures with a threshold and notified to the Operations team to proactively attend to issues
- Non-functional alerts for a spike in memory, CPU and the unresponsive app helps to take the corrective action
(The author of this article is Siva Shankar, VP Engineering – Payments, at Pine Labs. Views expressed in this article are that of the author.)