This post contains a list of notes about systems and software development.
Antifragility (Nassim Nicholas Taleb)
An antifragile system benefits from disorder
- Have options (instead of obligations)
- Create a situation with low downside and high upside
- Create systems with slack
- Convex: a smiling graph. More potential upside than downside (positive Black Swan)
- Concave: a frowning graph. More potential downside than upside (negative Black Swan)
- You are robust if doubling exposure doubles harm
- You are fragile if doubling exposure more than doubles harm
- A single 100 pound stone hurts more than 100 one pound stones
- A function is concave if the average function value is lower than the function of the average value
- A function is convex if the average function value is higher than the function of the average value
Traffic is a function, which depends on the amount of cars. This function is nonlinear (concave), as the following two situations create different outputs:
- One hour of 8.000 cars, followed by one hour of 12.000 cars
- Two hours of 10.000 cars
Important: Don't confuse x with f(x).
Meltdown (Chris Clearfield, András Tilcsik)
Why our systems fail and what we can do about it
Complexity and coupling increase the chance of big errors. In these systems, small problems and minor issues can accumulate into an unforeseen meltdown.
In the past, nuclear power plants were one of only a few systems which were both complex and highly coupled. Over the last years more and more systems entered this domain.
How to deal with these systems:
- Introduce slack
- Reduce moving parts
- Replace indirect feedback with direct feedback
- Example of indirect feedback: An indicator light that shows that a valve was instructed to close
- Example of direct feedback: An indicator light which shows the actual state (open/closed) of a valve
- Aviation and medical institutions use "near miss reports" to deal with small issues before they can spiral out of control
- Use hindsight bias by performing a premortem. Imagine that the project is completed and that it was a complete failure. What were the reasons why it failed?
- Subjective Probability Interval Estimates (SPIES): Estimate the probability of several outcomes
- Charm school (Everyone should be able to express concern)
- Get attention
- Express your concern
- State the perceived problem
- Propose a solution
- Ask for agreement
- Soften power cues
- This makes you more approachable, which can encourage people to speak their mind
- Leaders speak last
- Encourage discussion about different solutions
- Not encouraging a discussion can be compared to discouraging a discussion
Diverse groups are "better" because group members are more skeptical which in turn makes it more likely to catch small errors.
Create logs that hold the following information:
- Problem you are deciding on
- Involved people
- What did you choose? Why? Risks?
- What other things have you considered? Why didn't you choose them?
Michael Nygard described his decision log structure in this blog post:
- Context: Describe the forces at play
- Decision: Your response to these forces
- Status: The decision status (e.g. "proposed")
- Consequences: What happened afterwards?
- Write documentation if
- The thing you are documenting does not change often
- Your documentation will be useful to a large group of people
- Four Kinds of Documentation
- Tutorial: learning-oriented; a getting started guide
- How-To Guide: goal-oriented; how to solve a specific problem
- Explanation: understanding-oriented; provides background and context
- Reference: information-oriented; describes inner processes
The Phoenix/Unicorn Project (Gene Kim)
- System Thinking
- Amplify Feedback Loops
- Culture Of Continual Experimentation And Learning
Five Types Of Work
- Business Projects
- Internal IT Projects
- Changes generated by the above
- Unplanned Work
- Locality and Simplicity
- Focus, Flow, and Joy
- Improvement of Daily Work
- Psychological Safety
- Customer Focus
Accelerate (Nicole Forsgren PhD, Jez Humble and Gene Kim)
The book outlines 24 capabilities that drive improvements software delivery performance:
- Version control
- Deployment automation
- Continuous integration
- Trunk-based development
- Test automation
- Test data management
- Shift left on security
- Continuous delivery
- Loosely coupled architecture
- Empowered teams
Product and Process
- Customer feedback
- Value stream
- Working in small batches
- Team experimentation
Lean Management and Monitoring
- Change approval processes
- Proactive notifications
- WIP limits
- Visualizing work
- Westrum organizational culture
- Supporting learning
- Collaboration among teams
- Job satisfaction
- Transformational leadership
- Partition tolerance
This theorem states that given a network partition (a split brain situation), a distributed system can either favor consistency or availability.
Example: Imagine that you are part of a fully remote wedding. Let's imagine that you answer "the big question" with yes and let's imagine that your phone/internet connection breaks before your significant other can answer the same question. What do you do? Do you say that you are married/not married (in which case you would favor availability) or do you say "I don't know" (in which case you would favor consistency).
The PACELC theorem is an extension to the CAP theorem: In case of a network *P*artition, one has to choose between *A*vailability and *C*onsistency, *E*lse, when a system does not have a partition, one has to choose between *L*atency and *C*onsistency.
Martin Kleppmann posted an interesting article in which he explains that neither CAP nor PACELC are a good way to think about distributed systems.
Jeffrey Fredrick (Episode 14)
Employee satisfaction can indicate the performance of an organization. Ask:
- Are you happy?
- Are you able to do work that you are proud of?
Scott Havens (Episode 22 & 23)
- State management techniques used in functional programming can be used to scale large architecture such as Wal-Mart's warehouse system
- Working with event streams as input and output makes a system decoupled, easier to test and more understandable
- A pure function can be replaced by a lookup table
- Scott mentions a story about how they recovered from a disaster involving the death of a Kafka cluster
- He also tells the story of how he replaced a synchronous call graph involving 23 procedures with asynchronous computation
- Category theory is not too important in your day-to-day use of functional patterns
Pilot Decision Management (Clifford Agius)
The TDODAR framework:
- Is it an emergency? Do we have to act quickly?
- Do we need to and can we make more time?
- Do we have time for a cup of coffee?
- Start a stopwatch and make sure you come back to the "T" to check if things change
What do we think happened?
- Discuss the symptoms
- Ask open questions
- Find some information to tell me this is not XYZ
- Agree on the issue to be tackled
- Make it quick and concise, the clock is ticking!
What should we do?
- Brainstorm possible options
- Tell people "Give me another option!" if the discussion dries up
- Take input from all members of the team and outside sources
- No such thing as a silly idea. Verbalize everything
- Don't drag it out, be quick. Often the first ideas are the best anyway
What are we going to do?
- As a team decide what is the correct or chosen path
- Don't spend too much time deciding, pick an option and go with it
- State the decision
Assign tasks to the members of the team.
- Team leader assigns tasks
- Make tasks short and within the skills of that team member
- It is not a race!
- Complete your task as well as you can but don't delay completion
- If you can't complete ask for help
- Consider overload
- Has the issue been resolved?
- Do we still have time?
- Quickly repeat TDODAR to see if actions have changed the answer
- Is it still a good decision?
Sharpening the Tools (Dan North)
Need rules (not patterns) to guide their way: Don't ask. Follow this advice and you will be fine.
Do not follow the rules! Find out why the rules are the rules. We are starting to get context - we experience how stuff works.
You become goal oriented. This is a time based thing. Most people become competent if they keep doing something.
This is a deliberate step. Things start to become intuition. Patterns start to become useful. "How can I make this better?"
You are operating of instinct. You don't think about rules, you "just know". This is critical: You don't know how you come up with your decision.
Learn to love meetings (Dr. Neil Roodyn)
- Have a timeline and an agenda
- "Check-in": Say your name, how you feel and your expectations at the beginning of a meeting
- Decisions are made via votes
- I don't care
- No (you have to provide an alternative to discuss and vote)
- A decision can be postponed through an "investigation". This is used to ask clarifying questions
- They were using a dashboard to display metrics to analyze how the meeting time was spent (e.g. fiddling with the projector, actual discussions and so on)
- Lean Coffee
Preventing the Collapse of Civilization (Jonathan Blow)
- Technology on its own will degrade. It needs constant effort to improve and not lose technology
- Without generational transfer, civilizations die
- We keep adding complexity, which means that each individual knows less and less about a system
- We are reducing development time by using existing tools and frameworks, but we are also giving up capability. This is fine in isolation, but it might become a problem if everyone does it
- Only a handful of people really know how a CPU works
- Our tools change our thought process
Don't Walk Away from Complexity, Run (Venkat Subramaniam)
- Two kinds of code frustrate him:
- One that won't work
- One that works, but shouldn't
- Shared mutability is the devil's work
- Using a library is like dating, using a framework is like getting married
Transactions - Myths, Suprises and Opportunities (Martin Kleppmann)
ACID is more or less a marketing term, it isn't too precise.
Used to mean that your database is written to an archive tape. When tape bands fell out of fashion, durability was redefined as "fsync to disk". With the rise of distributed system, durability was redefined once more to mean replication.
- This is not the same C as in the CAP theorem
- A database moves from one consistent state to the another through transactions. A consistent state is defined through integrity checks or invariants (e.g. the balance of an account cannot be negative)
- It is a property of how the application uses the database, it is not a property of the database itself
- "All or nothing guarantee"
- It's about handling crashes/fault, not about concurrency! You either get all or no parts of a transaction
Serializable isolation means that the effects of concurrent transactions is as though all transactions were performed in a serial (one after the other) fashion. Each transaction feels as if it had the whole database for itself.
Databases have different default and maximum isolation levels. These levels are:
- Read Uncommitted
- Read Committed:
- Dirty reads/writes are not allowed
- Does not prevent Read Skew (see below). This is scary, as "Read Committed" is the default isolation level for several databases
- Snapshot Isolation:
- Read skews are not allowed. If a transaction is reading the database, the transaction sees the database at a specific point in time. Other transactions do not interfere.
- Does not prevent Write Skew
- Repeatable Read
- Can prevent Write Skew
- Some implementations use two-phase-locking (not two-phase-commit!), which use shared locks. This can be problematic, as analytical queries lock the whole databases.
- Other solutions (which don't use shared locks) are H-Store and Serializable Snapshot Isolation
- Dirty Read: A transaction can read what another unfinished transaction wrote
- Dirty Write: Concurrent writes to several tables can interfere with each other
- Read Skew: Imagine a transaction which transfers 100 dollars from one account to another. A backup process might read both accounts at different times (one before a transaction, and one afterwards), which means that the backup now contains inconsistent data
- Write Skew:
- Pattern: Read something, make decision, write decision to database
- Example: An ambulance system requires, that each shift has at least one doctor on call. If several doctors request to go off call at the same time, we can end up in a situation in which no doctor is on call. This can happen because these concurrent transactions see the exact same snapshot of a database
- "By the time the write is committed, the premise of the decision is no longer true"
How did we end up here (Todd Montgomery & Martin Thompson)
- Focus on the fundamentals. Master them and understand them before you try to change them
- Shared mutable state is a complete nightmare and should only be used for
systems programming. The smartest people get this wrong all the time
- A cache is one the hardest problems in computer science. Do you really want to implement it yourself?
- Embrace append-only, single writer, and shared nothing designs
- Universal scalability law: You can't run away from math
- Stop using text encoding. The web is in a constant "debug mode"
- Synchronous communication is the crystal meth of distributed programming. Remote Procedure Calls do not work
- Object orientation and set theories are great models. Please don't use ORMs to make them work together. If you don't understand SQL, please do not use a database
- "The purpose of abstraction is not to be vague, but to create a new semantic level in which one can absolutely precise" - Dijkstra
- Think in terms of transformation and flow of data - not code!
- Farley's second law: "As soon as you realize that most people don't know what they are doing the world makes a lot more sense"
It's about time (Christin Gorman)
The basic time library in your favorite programming language might be horrible. Why? Because they tend to mix two very different concepts:
- The linear progression of time
- An interpretation of time, based on politics, astronomy and history
What time is it? 1532428776. No, I mean what time is it? Well, that depends. Which epoch do you mean?
|.NET||1 Jan 0001|
|Windows||1 Jan 1601|
|Unix||1 Jan 1970|
|GPS||5 Jan 1980|
A timestamp on Windows means something completely different than a timestamp on Unix!
Time synchronization (clock drift correction) is the reason why Windows does not guarantee, that the system time increases monotonically. So you shouldn't use it. Instead, use something different like the current tick count, or use your own sequence number.
UTC (which stands for Coordinated Universal Time) is an effort to create a system on which we can all agree.
- Store timestamps as UTC together with a time zone
- Do not store start/end timestamps. Instead, store a start timestamp together with a duration. This makes it much easier to deal with events such as day-light saving
- Don't always mock out your database layer. The conversation of dates (which can depend on the time zone of your database and on the time zone of your operating system) will hunt you down
- Make date ranges inclusive from and exclusive to (start <= value < end)
PID Loops and the Art of Keeping Systems Stable (Colm MacCárthaigh)
Present -> Observe -> Feedback -> React -> (Present)
A furnace is a classical example of applied control theory: you want to keep water at a specific temperature. So what do you do? You measure the error (e.g. the water has 20°C, it should be 100°C, so the error is 80°C) and react with correcting actions based on the error. To do this, we distinguish three types of controllers:
- Takes proportional steps to correct an error (e.g. the applied heat is proportional to the measured error)
- These systems tend to oscillate around the desired state
- Adds an integral to observe an error over time
- Such a system still oscillates, but the overall error curve is flattened
- Thermostats or cruise controls use PI systems
- These systems cannot deal with shocks
- Adds a derivative component to predict future errors
Using open loops is scary. The system cannot detect a problem. Chaos engineering and observability are fine practices to find open loops. Open loop systems tend to be imperative (do this, do that), while closed loop system tend to be declarative (please get the system into my defined desired state).
Power laws are out to get you. A system failure can spread in an exponential way. These failures can be kept in their cages by building smaller systems (which decrease the overall "blasting radius"). Other techniques include:
- Exponential back-off
Sudden load spikes can bring down a system. In general: keep your queues short. LIFO queues might be a good idea, as they will prioritize new information.
Implementing edge triggered systems imply, that you have solved the "deliver just once" problem. Level triggered (and idempotent) systems seem to be a simpler solution.
Big Numbers and the 1Hz CPU (Tom Hudson)
We do not have a good intuition for how fast different parts of a computer are.
Let's have a look at a 3ghz CPU and different access times:
- Register: 0.3ns
- L1 cache: 1.5ns
- L2 cache: 3ns
- L3 cache 13ns
- RAM: 0.1 microseconds
- HDD: 6ms
- SSD: 80 microseconds
All these values seem "low enough", but let's but them into perspective using a 1 Hertz CPU:
- Register: 1 second
- L1 cache: 4.5 seconds
- L2 cache: 9 seconds
- L3 cache: 39 seconds
- RAM: 5 minutes
- HDD: 9 months
- SSD: 1 day
Design, Composition, and Performance (Rich Hickey)
- Design is taking things apart so you can put them back together
- An instrument is a tool for an expert
- You learn an instrument by playing the actual instrument. There is no real alternative. This means, that you are using an experts tool while being a novice. But you won't be a novice for long
- An instrument is (for the most part) very simple. It is made to work in a very specific way. Composers can use several instruments to create a predictable outcome. This would be hard if instruments weren't that limited
- A musician spends most of his time practicing instead of performing. Why is our industry different?
- We should build interfaces for machines first and then put an interface for a human on top
- Constraint is a driver for creativity
- Design is making decisions. It's about saying no
Thinking Fast and Slow (Linda Rising)
- Unconscious (runs 24/7)
- Fast, intuitive
- Can multi-task
- ~11 million bits/second
- 95% of cognitive function
- Slow, rational, forgetful
- Linear (Cannot multi-task)
- ~40 bits/second
- 5% of cognitive function
We identify with System 2 and we believe, that System 2 is in charge.
System 1 gains its speed by using heuristics. It is also in charge of "telling our story" in which we are identified as the hero. System 1 is prone to biases such as:
- Confirmation bias: We seek confirmation instead of information. We like to stick to our point of view, even in the face of evidence which supports a different point
- Cognitive dissonance: We have a hard time to keep two contradicting ideas in our head
- Naive realism: We believe that we are rational and that a disagreeing part will "see" if we present them "our facts"
We overestimate our own understanding and underestimate the role of randomness in our world. We seek for patterns and explanations, even if there aren't any.
System 2 can only focus for about 50 minutes (max) before taking a break.
We use System 2 to learn something new. Over time, a certain skill moves to System 1 (e.g. walking, driving, or playing an instrument). After is has moved to System 1, interference from System 2 can hurt our performance by "overthinking".
System 2 takes a lot of energy. Self control causes a drop in your blood glucose. We have a limited pool of "mental energy". This is why we tend to make worse decisions when tired or hungry.
System 2 believes that it runs the show, but System 1 is in charge! And that's good. You don't want to trust a system which lets you forget your keys to care about essential tasks such as breathing.
- Water, tea, coffee available
- Standing should be OK
- Very small groups
- Limit meeting times to ~40 minutes. For longer meetings, take a different seat after a break
- 10 minute break before important decisions
Mistakes and Discoveries While Cultivating Ownership (Aaron Blohowiak)
- Avoid rules: Do not constraint people. We need good judgment
- People over Process: The world is changing, while your process is lacking behind
- Context not Control: You can't really good decisions if you do not understand your environment. A manager knows less than the "people in the field"
- Freedom and Responsibility: Have options and hold people responsible for the quality of their decision making
Levels of Ownership
- Demonstration: No ownership
- Oversight: You do it, but we will pre-approve it
- Observation: You do it and we will review it after it is done
- Execution: Here's where we want to go and we know that you will pull it off. We might check just so that we know what's going on
- Vision: You understand your responsibilities and your shareholder's needs
- Different ideas about which level we should be at
- Not being explicit when levels change
Changing your Habits & Environment to get more Professional Productivity (Linda Rising)
- We sit too much and move too little
- Lying down can improve your problem solving skills
- Try to have meetings while walking
Functional data that adapts to change (Don Syme)
- Classic UIs are built using the MVVM pattern
- A different approach to building UI is called MVU: Model, View, Update
- Examples: Svelte, Elm, React Native
- MVU is based on functional principles
- There is a unidirectional data flow
- "UI becomes calculation and information, not state"
- We create a view based on a model and update the model through messages, which in turn changes the view
- An initial reaction might be that "functional" and "high performance" cannot go together. The key to making it work is "incremental functional programming", which is related to event sourcing
A Cheap Effective Method for Dealing with Stressful Situations (Linda Rising)
- The pandemic has created a very stressful environment
- Long periods of anxiety compromises our immune system
- What doesn't work:
- Suppressing/denying a stressful situation
- Positive thinking (not strong enough)
- Blaming others/circumstances
- What does work: expressive writing. Write about your troubles
- General instructions
- Write 15-20 min/day for 4-5 consecutive days
- Topic should be personal and important
- Write continuously. Don't worry about punctuation, spelling, grammar. If you run out of things to say, repeat what you have written. Keep pen on paper.
- Write only for yourself. Destroy or hide what you are writing. Do not turn the exercise into a letter. The result is for your eyes only.
- If you feel you cannot write about something because it will push you over the edge, STOP!
- Some feel sad after writing, especially on the first day. This feeling usually goes away in an hour or so
- Pen and paper work best, but typing or voice recording are OK
- Writing before stressful situations (e.g. test taking, presentations, surgery, …) can also be beneficial
If (domain logic) then CQRS, or Saga (Udi Dahan)
- Hard deletes are painful as they can lead to cascading deletes (e.g. deleting a product may delete user purchases)
- We use soft deletes as a "quick fix" to the cascading delete problem
- But deleting makes a lot of sense in a "private domain", e.g. when a user updates the product catalog. We can treat this domain as a sandbox, where the user can manipulate data in an easy way
- We need to validate data when we are publishing it from the "private domain" to a "public domain" (e.g. so that the customer can see the updated product catalog)
- Deletes in a "public domain" hide business intent. Why do you want to delete data? Do you really want to delete this product, or do intent to no longer sell this product?
- Systems like Amazon are a collaborative domain. Checking invariants is doomed to be full of race conditions. Example: A user adds a product to his shopping cart. An employee marks the same product as "not for sale". Depending on the timing of these requests, an invariance such as "a user cannot buy an item if it is not for sale" cannot hold.
- We need to deal with eventual consistency in the context of the business. Don't confuse this with technical eventual consistency (e.g. updating read models)
Cultivating Architecture (Martin Fowler, Birgitta Böckeler)
- Good architecture can accelerate a team as it can keep the cost of change down
- Software delivery performance correlates with organizational performance
- Strive to create autonomous teams
- Inform technical staff about the business goals
- Create a set of guiding principles which should help a team when dealing with
- Find principles by identifying what's moving you forward and what's holding you back
- Create your own tech radar. What technology do we use? What do we want to try? What do we want to get rid of?
- Document any decisions. A simple markdown file might be enough
What I learned from three years of sciencing the crap out of DevOps (Jez Humble)
- Job satisfaction is the biggest indicator for organization performance
- IT companies with high throughput perform better in terms of stability
Files (Dan Luu)
- We believe that file systems are a solved problem and that they share a common abstraction, but that is not true
- Writing a file may seem easy, but there's a lot that could go wrong. File systems have bugs too
- File operations may not be atomic
- Even great programmers make mistakes when using the file system. Static analysis tools found bugs when inspecting code bases such as Git, Postgres
- Sqlite is a rather stable way to interact with the file system
- Different file systems have different behavior when dealing with errors
- SSDs need ECC (error correcting codes) not be "better", but rather to work at all
- Computers don't work
Optimize For Time (Andy Walker)
- High performing teams seem to have more time to get things done. They don't hurry. They hit their deadlines
- Struggling teams seem to always be behind
- Busyness is a curse
- Four things he holds true
- Invest in improvement
- Respect each other's time
- Ruthless about time (say no to things that don't make sense)
- Anticipate problems
- What if the team is the product? If you are not investing in moving faster you're moving slower
- Only interrupt people if there is an important reason to do so
- Teams that invest in each other achieve more
- Change is expensive! Fail fast
- When given a hard deadline, work from the basis that everything is going to
- Plan for failure
- Plan to fail cheaply
- Your plan is not the outcome
- Recover quickly
Conversational Transformation (Jeffrey Fredrick, Douglas Squirrel)
Conversational Analysis with The 4 Rs
- Fold a piece of paper in half. Write the major points of your conversation on the right hand side. Record what you thought (but didn't say) on the left hand side
- How many genuine questions were asked?
- What is on the left side that isn't on the right?
- What sets off negative reactions for you?
Continuous Retrospectives (Linda Rising)
- In times like COVID we cannot even remember what day it is. How can we then have a meaningful discussion (retrospective) about a long project?
- Continuous retrospectives: Hang up a timeline and add sticky notes through out the day. Capture ideas, questions, concerns, events, problems, success, failure
- Spend the last 15 minutes writing about, reflecting on lessons learned that day
- Guide Boards
- Retrospectives offer different opportunities:
- Project: long term learning (strategic)
- Iteration: what should we do now? (tactical)
- Continuous: small experiments
Solving Problems the Clojure Way (Rafal Dittwald)
- Imperative code spreads state, mutation and side effects, which makes larger programs harder to understand and change
- Object oriented programming tries to solve these problems through classes and encapsulation. The preferred thinking model revolves around agents and how they communicate with each other
- While we cannot get rid of state, functional programming uses a set of
techniques to avoid state wherever possible. Rafal outlines a few techniques:
- Minimize state
- Derive state from other state (e.g. the current player of a Tic Tac Toe game can be derived based on the board state)
- Use immutable data structures instead of mutation
- Pass lambdas
- Concentrate state into fewer places
- Defer actions (e.g. Elm architecture)
- Minimize state
- Given a graph of components, the typical OO approach is to keep state separated by pushing it down as far as we can. The FP approach would be to put all the state into the root node
Persistent Data Structures and Managed References (Rich Hickey)
- Pure functions have no notion of time and no effect on the world
- Concurrency breaks variables badly
- Might not be atomic (e.g. long)
- Need locks or volatile keywords
- Identity: An entity we associate with a series of relates values over time
- Can be a composite (e.g. the members of a sports team might change, but we still consider it to be the same sports team)
- State: Value of an identity at a time
- Value: An immutable structure (e.g. numbers, strings, …)
- Overall philosophy
- Things don't change in place
- See time as a dimension
- The future is a function of the past (and doesn't change it)
- Co-located entities can observe each other without cooperation
End to end functional tests that can run in milliseconds (Nat Pryce)
- They applied the hexagonal architecture model in combination with "screenplays"
- Tests can run in different scenarios (in memory, using a Browser with or without JS, REST calls, …)
- They put all interactions (e.g. steps a user takes to update his mail address) behind an interface, so that these "use cases" don't know anything about a scenario. This technique allows the team to change an N*M mapping to an N+M mapping
- Gives great feedback about the actual state of the system. Such an approach can find problems in your CDN configuration or your caching policies
- Makes the overall system more observable
- To test/maintain a system we need to
- Know what the system is doing
- Know when it has stopped doing it
- Know when the system has failed
- Explain what has gone wrong
- Restore the system to a good state
Design Microservice Architectures the Right Way (Michael Bryzek)
- Describe APIs/Events/Databases (e.g. by using JSON) and invest in tooling
- Create custom linters to ensure that common naming conventions are used
- Use code generation to automate API creation using CI/CD
- Use code generation to create mocks
- Create databases on the fly
- Each microservice owns its own database. Other services use APIs + Events
- Event principles:
- Producers guarantee at least once delivery
- Consumers implement idempotency
- Design schema first for all APIs and Events
- Consume Events (not APIs) by default
- Invest in automation
- Deployment, code generation, dependency management
- Enable teams to write amazing and simple tests
- Drives quality, streamlines maintenance, enables continuous delivery
Entity Component Systems and You: They're Not Just For Game Developers (Paris Buttfield-Addison, Mars Geldard, Tim Nugent)
- A paradigm/architecture which is commonly used in the game industry
- ECS separate data and logic
- Entities have IDs. They are similar to primary keys and are used to identity everything. Examples: camera, tree, player, enemy, particle
- Components have data. Components are used as an alternative to hierarchies. So ECS favor composition over inheritance. Examples: Position component, Velocity component, Damage component
- Systems have logic. These systems are often chained together and can be compared to functional programming. Example: Update position of every player, determine hits, calculate damage, render
- ECS are often combined with data-oriented design to improve performance by reducing cache misses. These designs can be compared to an in-memory database
- Performance (data oriented design, parallelism)
- No hierarchy
- Have similar advantages as microservices and functional programming
- More code upfront
- Hard to keep everything in your head
- No clear starting point
Programming Is The Easy Part (J. B. Rainsberger)
A lot of high level software design principles boil down to a linear combination of "remove duplication" and "improve names".
Modern SQL A lot has changed since SQL 92 (Markus Winand)
SQL has changed a lot, though most developers only know the 1992 standard
- WITH clause: create "private" views to make a query more readable
- WITH RECURSIVE: is an implementation of loops in SQL. It can be used to walk hierarchies
- GROUPING SETS: use several GROUP BY statements at the same time
- FILTER: Adds WHERE expressions to aggregates
- OVER and PARTITION BY: Aggregates without GROUP BY. Can be used to implement features such as row-based balancing
- FETCH FIRST: also known as LIMIT
- OFFSET: gives the remaining data when using FETCH FIRST, but there are traps. Don't use it
- OVER: window functions
- System Versioning: Can be used to show tables at a given time. Adds audit features to destructive changes such as INSERT, UPDATE or DELETE
Testing as an equal 1st class citizen to coding (Jon Jagger)
- The Equilibrium law: stable systems tend to oppose their own proper function
- All changes can be understood as the effort to maintain some constancy, and all constancy as maintained through change
- You will not increase the speed of your car if you think that your brakes are unreliable
- Tests act like brakes when developing software
- Are you confident enough to delete "dead" code?
Technical Leadership and Glue Work (Tanya Reilly)
- Glue work
- is work that makes the whole teams better
- is expected when you are senior
- and risky when you are not
- (people might not be rewarded for it)
- Women tend to volunteer more often to do unpromotable work than men
- Men also volunteer less because they know that women will step in if no one volunteers
- What do you want to get better at?
- The vast majority of our learning happens at our job
The Only Unbreakable Law (Casey Muratori)
- Conway's law states that a piece of software tends to reflect a company's communication structures (its org chart)
- The intended title of this talk should have been Conway's nightmare
- What Conway did not anticipate: A piece of software does not only reflect the current org chart, but it most likely also contains fragments of previous org charts
- Windows contains at least four different volume controls which were all created in different versions of Windows
- We create organizations and groups to tackle problems that we cannot solve alone. They are in a way a necessity, but they are not inherently "good"
- Developers tend to do the same thing when they are writing code: They create class hierarchies so they can divide a problem which they cannot keep in their head. Just like org charts, they might be too complicated or inefficient
Improving eBay's Development Velocity (Randy Shoup and Mark Weinberg)
- Randy and Mark used the Accelerate book
- Used DORA metrics to track progress
- Teams delivered >2x the features
- Focused on removing bottlenecks
- How could we deliver once per day? - Here is a list of 20 things that are holding us back
- CEO: "The most important initiative at the company. Go faster!"
Uncoupling (Michael Nygard)
- Determines degree of freedom
- Enables some movement
- Inhibits other movement
- Connects effects
- Is necessary and inescapable
Kinds of Coupling
- Operational: Consumer cannot run without the provider
- Development: Changes in producer and consumer must be coordinated
- Semantic: Change together because of shared concepts
- Functional: Change together because of shared responsibility
- Incidental: Change together for no good reason
- Is inversely proportional to the number of interfaces
- Is inversely proportional to the number of data types
Make Impacts Not Software (Gojko Adzic)
- Typical software road maps are better described as tunnels, since these "maps" typically only contain a single road/approach
- A real road map contains several different ways to reach a specific goals
- Before the invention of GPS a long trip involved a lot of upfront planning
- A GPS eliminates this upfront planning by recalculating potential routes depending on-the-fly (e.g. in case of heavy traffic or an accident)
- Shipping small increments are the equivalent to a GPS recalculation process. We can use fast feedback to decide how to change our route (by using a road map)
- Software projects typically do not have a specific destination. Most of the time the "real" destination arises while we are developing something new
- People measure what's easy, not was is important
- Story points, time estimates or bug counts are negative metrics. They tell you when something is wrong, but they cannot tell you if everything is alright. "Zero bugs" could mean "great quality" but it could also mean "no or poor testing" or "nothing new was delivered". In other words: absence of evidence is not evidence of absence
Protect Yourself Against Supply Chain Attacks (Rob Bos)
- Libraries used by your application and tooling used to build your application
- Supply Chain Confusion (typo squatting, namespace shadowing, configuration files, pipeline attacks, pipeline artifact attacks)
- Typo squatting: a malicious copy of a well known package is published using a
slightly different name (e.g.
- Some package managers offer a namespace feature (e.g.
@azure/some-packageinstead of just using
- You not only want to know which packages (and their versions) you are using. You also want to know where you got these packages from
- Protect yourself using software composition analysis (AST or DAST - Static/Dynamic Application Security Testing)
- Package manager scanners: WhiteSource, BlackDuck, GitHub Dependabot, snyk.io
- You can use CVE databases to check that your packages do not contain known issues
- We want to find issues as fast as possible. In the best case we find an issue before we commit code or run a CI build
- OWASP Software Component Verification Standard (SCVS)
- Supply Chain Levels for Software Artifacts (SLSA)
Monitoring Is Not Observability (Baron Schwartz)
- Observability: an attribute of a system
- Instrumentation: measurement points
- Telemetry: the measurements themselves
- Analytics: turning telemetry into answers
- Monitoring: checking/evaluating system state
- Events, Logs, Metrics, Traces
- It's all derived from events
- Kinds of telemetry
- USE: Utilization, saturation, error
- RED: Requests, errors, duration
- SRE Book: Latency, traffic, errors and saturation
- Queuing theory
- Little's law
- Universal scalability law
Working at the Center of the Cyclone (Dr. Richard Cook)
- Complexity is change
- It's not surprising that your system sometimes fail. What is surprising that it ever works at all
- You build systems differently when you expect them to fail
- Failure is normal. Failed state is the normal state
- You need to build an organization that is able to recover from failure
- People are part of "the system"
- You never see "the system", you only see a representation (what you see on your screen)
- An incident is something that occurs in the mind of people who read representations of a system
- No mental model is "the system"
- Ordinary firms experience one to five acknowledged events per day
- As the complexity of a system increases, the accuracy of any agent's model of that system decreases
- Rollbacks do not keep you safe
- You need to consider "the system" and "the organization" (which are part of the system) to be successful
- Incidents are bits of wisdom. They show you where your mental model differs from "the system"
Resilience In Complex Adaptive Systems (Dr. Richard Cook)
- Rasmussen's system model
- Economic failure boundary
- Accident boundary
- Unacceptable workload boundary
- The operating point tends to move towards the accident boundary
- If you get people together for a meeting about how important some topic is, you know you have failed
- We introduce a margin which should keep us from reaching the accident boundary. This also applies to speed limits or telling your kids "no, the stove is hot!". The problem is that we don't really know where the accident boundary is
- Normalization of deviance: Crossing over the margin line over and over without a problem makes us wonder what the big deal is. Is this margin too conservative? We are "flirting" with the margin
- Resilience: monitoring, reacting, anticipating and learning activities
How Complex Systems Fail (Dr. Richard Cook)
- We have the "as imagined" and the "as found" world. These are pretty different worlds!
- We design for reliability
- stiff boundaries, layers, formalism
- defense in depth
- interference protection
- We want resilience
- withstand transients
- recover swiftly and smoothly from failures
- prioritize to serve high level goals
- recognize and respond to abnormal situations
- adapt to change
- The time between maintenance is zero. Continuous maintenance should be part of the design
- Reveal the actual controls to your operators so that they can help you in case of accidents. Developers tend to design systems that make it impossible for people to do things. We are trying to protect systems from people
- Heavy machines have actual markers that show where you can lift them, since the manufacturers know that you will move them. We should also consider similar scenarios when dealing with software
- Support mental simulation by giving operators insight into the system
- Black boxes (hiding all details behind layers of abstractions) are a big mistake. We have to know the inside of a black box to reason about it
- Resilience agenda:
- Operators are competent to hold the keys to the systems we build
- Make resilience engineering the first priority of design for next gen systems
- Commit resources to discovering, understanding and supporting resilience through the system life-cycle
Sleeping with the enemy (Gojko Adzic)
- Manual testing is a bottleneck
- Let developers watch testers so that they can build understanding and trust
- A software architect is somebody who writes very small parts of code for critical systems. Most of his time is spent on mentoring and helping others do their job
- The role of a tester should be similar to the role of a software architect, so let's turn testers into "test architects"
- This approach inverts the flow. Developers no longer push code to testers
- "It makes much more sense to get the programmers involved to automate the tests while testers come up with the right test cases to automate."
- "I hate story points! Story points are useless! Story points measure effort, they are so easy to cheat, story points don't measure outcome. So, what we need to look at is: what is the outcome? how do we measure the outcomes? And then, that measures the productivity because that is what really productivity is. I don't care about lines of code, tests cases produced…What is the outcome?"
Diagrams as Code 2.0 (Simon Brown)
- Simon is the author of https://c4model.com/
- The C4 model describes a set of abstractions which can be used to create architecture diagrams that behave similar to Google Maps, where you can zoom in and out of a map to change the amount of details you see. A legend is used to explain notation
- Diagrams as code 1.0 is a nice way to create version controlled diagrams
- Diagrams as code 2.0 describes an overall model of your architecture which can then be used to create one or more views (diagrams as code 1.0)
- Simon was created open source tooling to describe an architecture model using the Structurizr DSL
Software Architecture, Team Topologies and Complexity Science (James Lewis)
- The book "Team Topologies" outlines the four fundamental teams you need to
build software fast:
- Stream-aligned teams
- Enabling teams
- Complicated subsystem teams
- Platform teams
- Mice, humans and elephants have roughly the same amount of heartbeats in their lifetime. They also have the same blood pressure. The bigger a mammal, the slower they live
- Complex adaptive systems (mammals, cities or companies) show sub-linear scaling: doubling one factor (e.g. size) does not double other factors (e.g. calorie intake, cost of building streets, revenue)
- Hierarchical fractal networks scale following a power law with an exponent of less than one
- Queues create back-pressure. Putting a queue into an information flow pauses the flow
- Larger organizations spend less of their revenue on R&D
- Cities show more than one type of scaling:
- Super-linear: innovation, wages, number of professionals, crime, disease, pollution (social network)
- Sub-linear: road length, number of petrol stations and restaurants, water pipes, electricity cables (hierarchical network)
Automation Is Hard & We Are Doing It Wrong (Johan Abildskov)
- What is DevOps? One Definition: Culture, Automation, Lean, Measuring, Sharing
- Automation is not a luxury. It's a permission to play
- "I don't want to buy software from people who are wasting their time"
- "But our customers won't pay for automation" - well, they most likely don't want to pay for your dailies, retrospectives or coffee breaks either
- Minimize the cost of adding one more engineer
- Maximize the value of adding one more engineer
- Why digitalization will kill your company too
- Limited software skills in senior leadership
- Ambidexterity: solve today's challenges while preparing for future needs
- Leaders believe that digitalization is an R&D problem
- Justify their lack of initiative by referring to the lack of desire for change from their most valuable customers
- Automation is not complex. An excavator is an obvious upgrade to a shovel
- Automating simple things is simple. Automating complex things might be impossible (without losing your sanity)
- Industry and technology stack doesn't matter. Architecture does (Nicole Forsgren, PhD)
- Automation is software. We should treat it as such!
- Use version control
- Create documentation
- Have tests
- Monitor your automation. Use circuit breakers
- Does my automation do something silly?
- Use overrides but add checks to find stale overrides
- Idempotency is your friend
- Jevons Paradox: Increased efficiency != reduced consumption
- Keep in mind that an increase in automation can increase manual work (in other areas)
- Automation is way more than cost-down
Engineering Documentation (Lorna Jane Mitchell)
Kinder, es tut mir undendlich leid (Martin Leyrer)
"We build our computer (systems) the way we build our cities: Over time, without a plan, on top of ruins" - Ellen Ullman
An Introduction to Residuality Theory (Barry O'Reilly)
- How can we apply complexity theory to software development?
- Creating software can be boiled down to a two step algorithm:
- a random simulation of our environments, followed by
- an NKP analysis
- N: the amount of components
- K: the amount of connections between components. More connections leads to more chaos
- P: a bias between connections. A higher bias reduces chaos
- Stressors can be categorized using attractors. We often design a system for a single attractor, which is a mistake
- A good start: "What if a giant laser lizard burns down our city? What's your residue?"
- A matrix which maps components and stressors can help you to identify non-function requirements. It shows you hidden coupling and weak spots
- The presenter also shows a concrete example using the architecture of a business around charging electric cars
The Case for Technical Excellence (Kevlin Henney)
- Some organizations seem to believe that feature work and technical work are mutual exclusive and that they can choose to focus on either. Features are made of software. This sounds obvious and trivial, but just highlights that technical and feature work are deeply linked. There is no choice
- The agile manifesto mentions that technical excellence enhances agility
- Architecture represents significant design decisions, where significant is measured by cost of change. You are dealing with software, you can deal whatever you want. To interesting question is how fast/cheap can you do a change?
- Technical neglect is the cause of technical debt
A New Era for Database Design with TigerBeetle (Joran Greef)
- TigerBeetle is a new type of database to track financial transactions
- Databases still have room for innovation and improvement
- Buffered I/O is broken, fsync has subtle issues which can cause loss of data
- Databases that rely on fsync are trying to change fundamental design decisions, which is hard work for a project with a long history
- TigerBeetle uses two write-ahead logs
- Storage faults force us to reconsider database design
- We need to move beyond a crash safety (power loss) model
- Greg Young
- Michael Feathers
- Bryan Cantrill
- Mark Seemann
- Jimmy Bogard
- Sam Newman
- Chad Fowler
- Kent Beck
- Casey Rosenthal
- Allen Holub
- John Hughes
- Andrew Kelley
- Dave Farley
- David Tielke (German)