For much of the last 10 years I have been working on the development and operations of a data management system. This system has two types of customers. The customers providing the data are publishers of science, technical, and medical journals, books, and organizers of conferences.
There are 14,000 publisher customers or their representatives. The customers using the data are also the publishers, but the primary users are researchers and those providing services to them. There are about 160,000 direct use customers. On average we make 300,000 changes per day to the data and service 8-15 million external requests per day.
The system is composed of statically allocated VMs running a common executable with each deployment having a function specific configuration. (All the VMs run CentOS 7.) We use several instances of HAProxy to dispatch external and internal requests to the appropriate deployment based on information in the URL and/or in HTTP headers. We run the same system in AWS and in our own data centers.
The common executable is a Java 8 implementation within a Spring 3 servlet framework. We build from a single code base and build a single WAR file for use in Tomcat 8. The Tomcat installations on the VMs have an identity and that identity is what the executable uses to configure itself on start up. Some of the deployments are microservices in that they manage their own data and are only accessible via network interface. Other deployments are distributed monoliths and so can be scaled out, but share Oracle and MySql replicated databases. All communication between deployments is over HTTP.
The application manages the canonical data in Oracle and MySql. (We only use MySql RDS in AWS.) Data is encoded in XML and deposited via HTTP API or administration webapp. All changes are made directly to the canonical data via JDBC. These changes are then propagated to secondary, function specific repositories on a periodic basis, ie the data is pushed. Secondary repositories are built upon MySql, AWS S3, Solr, Berkeley DB, and bespoke persistence data stores. Tertiary repositories are updated via change events using ActiveMQ (JMS). The different mechanisms for change propagation come from having to have firm control over consistency and stability for the secondary data and, less so, for the tertiary data. Operational logs are centralized and analyzed for deviations. Prometheus and bespoke ETL processes gather and present operational and business metrics.
The manager and I are the joint architects of the system. My responsibility is technical and his is line-of-business. We both architect operations, but I am not involved in hardware aspects of the data center.
There are 4 developers and 1 operations team members. (The manager is one of the 4 developers.) All developers do new development, maintenance development, and customer support. We use NetBeans for the IDE. We use Jira for issue tracking and light project management. We use Subversion for version control using feature branches and the trunk is always ready to release. We release every Tuesday. While the release process is efficient, we do not use a fully automated CI/CD process.
We do have an office location where 3 of the team members mostly work, but the manager and I are mostly remote. We also work closely with other staff in the UK and across the US. By necessity we operate as a remote-only team. We use Slack messaging, conference calls, and screen sharing.
The 24x7 production support is done by the operations member, the manager, and me.
I look forward to using these skills and learning new ones. I hope we can talk more about how I can help your team in the coming years.