Performance and Scalability
- Nitin Motgi
Goals
- Improving performance of single Tephra Transaction Server
- Make Tephra Transaction server scale horizontally
- Make Tephra Transaction server Highly Available (HA) with Isolation
- Improve operational aspects of Transaction Server
- Improve performance of Workflow scheduling to schedule 1000s of jobs / second
Areas of Focus
- Transaction Invalid List Management
- Tephra Single Server Performance Improvements
- Isolation and Scalability of Tephra Transaction Server
- Improving scheduling performance of Workflow system
High Level Requirements
Transaction Invalid List Management
- System should automatically handle pruning of the transaction invalid list
- Reduce operational complexity for running manual steps to prune invalid transaction list
- Applied during major and striped compaction
- Metrics around the current invalid list size
- Tool to inspect and report progress on pruning
Tephra Performance Improvement
- Single Tephra Server should be able to support up-to ~ 10K transactions/second
- Support read-only and hierarchical conflict detection
Scale and Isolation
- Run multiple instances of Tephra Transaction Server in active-active in single DC or multiple DC
- Isolation at namespace levelÂ
Technical Breakdown
P&S-001 : Invalid list pruning with major compaction
Currently, the invalid list keeps growing over time, if it's not pruned periodically using the manual process ( which is very tricky, time consuming and hard to operationalize ) the performance of transaction server gets affected. If we remove the manual process and make the list pruning automatic, it would reduce the operational complexity and also help improve the performance of the transactions. This will be implemented as hook into major compaction, meaning that the invalid list pruning would be triggered during major compaction.Â
P&S-002 : Tool(s) and User Interface to inspect and report progress on pruning
A tool and user interface that can show the progress of pruning when running, impact of pruning on invalid list, show any regions that are behind preventing pruning on invalid transaction list.
P&S-003 : Performance improvements of Single Transaction server
This will be on focused on improving the performance of single transaction server. As part of this we will be improving on the locking granularity during conflict detection, support read-only transactions, improve group commit efficiency, hierarchical conflict detection, transmit only latest snapshot of invalid list and more.Â
P&S-004 : Transaction server per namespace for resource isolation
Instead of using a single transaction server across all namespace, we would like to be able to have multiple instance of transaction servers supporting isolation for namespaces. So, there could be one instance of transaction server that would be supporting a namespace and that responsibility could be rotate among different instances of transaction server. Â
P&S-005 : Support for running multiple instance of Transaction server in Active-Active mode
Running multiple instance of transaction server in active-active mode to shed load or for disaster recovery. This will also tie in with many of the stories for replication.Â
P&S-006 : Invalid List pruning with stripped compaction
This will provide the capability to prune the invalid list during the stripped compaction or minor compaction depending on flexibility to do so.Â
Open Questions
Action Items
Â
Â
Created in 2020 by Google Inc.