Why some DVLA digital services don't work at night
(dafyddvaughan.uk)116 points by edent 5 days ago | 117 comments
116 points by edent 5 days ago | 117 comments
simonbarker87 a day ago | root | parent | next |
Yes I’d imagine the reason it still hasn’t been fixed after nearly a decade is management/politics etc. But it taking more than just 6 months will be technical. As a result it’s a job that falls into the area of being canned because it’s taking too long even though no one said it would be quick.
mjevans a day ago | root | parent |
There might be legal / compliance reasons. It can be incredibly difficult to replace a validated system that is known (or already accepted even if it's technically incorrect) to implement lawmaker dictated behavior.
Otherwise, I think a new approach might be to ignore the specifics of the old system, implement a new system, and a separate translation layer that can run on an export of the old system (or the old system brought back online, but read only after the overnight maintenance) and completely cut over during an otherwise holiday weekend.
mgsouth a day ago | root | parent |
> I think a new approach might be to ignore the specifics of the old system, implement a new system
It doesn't work like that. When you're revamping large, important, fingers-in-everything-and-everybody's-fingers-in-it systems you can't ignore anything. A (presumably) hypothetical example is sorting names. Simple, right? You just plop an ORDER-BY in the SQL, or call a library function. Except for a few niggling details:1. This is an old IBM COBOL system. That means EBCDIC, not UTF or even ASCII.
1.A Fine, we'll mass-convert all the old data from EBCDIC to UTF. Done.
1.A.1 Which EBCDIC character set? There are multiple variants. Often based on nationality. Which ones are in use? Can you depend on all records in a dataset using the same one (hint: no.) Can you depend on all fields in a particular record using the same one? (hint: no.) Can you depend on all records using the same one for a particular field? (hint...) Can you depend on any sane method for figuring out what a particular field in a particular record in a particular dataset is using? Nope nope nope.
1.A.2 Looking at program A, you find it reads data from source B and merges it with source C. Source B, once upon a time, was from a region with lots of French names, and used code page 279 ('94 French). Except for those using 274 (old Belgium). And one really ancient set of data with what appears to be a custom code set only used by two parishes. Program A muddles through well enough to match up names with C, at least well enough for programs D, E, and F.
1.A.3 But it's not good enough for program G (when handling the Wednesday set of batches). G has to cross-reference the broken output from A with H to figure out what's what.
1.B You have now changed the output. It works for D and F, but now E is broken, and all the adhoc, painstakingly hand-crafted workarounds in G are completely clueless.
1.C Oh, and there's consumer J that wasn't properly documented, you don't know exists, and handles renewals for 60-70 year old pensioners who will be very vocal when their licenses are bungled.
2. Speaking of birth years, here's a mishmash of 2-, 4-, and even 3-digit years....
mjevans 21 hours ago | root | parent |
Yes, that's why the new system has to be a complete replacement. Part of it's spec COULD be to provide backwards interfaces too, in case things can't all be cutover at once, but that would increase the project scope and also tie things down to the old system too.
Part of a full replacement system would be the option to use a _different_ set of rules, which better reflect current desires and are, hopefully, easier to implement.
Yes the old data would need to be _transcribed_ during it's restoration to the new system, and human bureaucratic layers can likely handle issues. Heck, they could do a deferred implementation of the new system where one long weekend the new system's brought up, and any of the issues that are noticed as kinks worked out. When there aren't any _noticed_ kinks in those tests have the results sent out to the stakeholders and solicit feedback on if there are any inaccuracies. Which might take a year or two of renewals and updates and the annual business as they see if the new notices are correct or not.
MadnessASAP a day ago | root | parent | prev | next |
> Having legacy data and systems for a few years is a challenge. Still having them after decades is government.
FTFY
worik a day ago | root | parent | prev |
> Having legacy data... after decades is incompetence
Harsh
How long does a person hold a drivers license?
NVHacker 11 hours ago | root | parent |
That's just data. "Legacy data" was used here to suggest a legacy database/storage system. The reality is that the situation is not due to an insurmountable technical problem but due to a combination of lack of funds / prioritization / motivation / knowledge.
abigail95 a day ago | prev | next |
Something is missing here, why do batch jobs take 13 hours? If this thing was started on an old mainframe why isn't the downtime just 5 minutes at 3:39 AM?
Exactly how much data is getting processed?
Edit: Why does rebuilding take a decade or more? This is not a complex system. It doesn't need to solve any novel engineering challenges to operate efficiently. Article does not give much insight into why this particular task couldn't be fixed in 3 months.
ajnin a day ago | root | parent | next |
The batch jobs don't take 13 hours. They're just scheduled to run some time at night where the old offices used to be closed and the jobs could be ran with some expectations regarding data stability over the period. There are probably many jobs scheduled to run at 1AM then 2AM, etc, all depending on the previous to be finished so there is some large delay to ensure that a job does not start before the previous one is finished.
As to your "not a complex system" remark, when a system is built for 60 years, piling up new rules to implement new legislation and needs over time, you tend to end up with a tangled mess of services all interdependent that are very difficult to replace piece-wise with a new shiny architecturally pure one. This is closer to a distributed monolith than a microservices architecture. In my experience you can't rebuild such a thing "in 3 months". People who believe that are those that don't realize the complexity and the extraordinary amount of specifics, special cases, that are baked into the system, and any attempt to just rebuild from scratch in a few months hits that wall and ends up taking years.
PaulAJ a day ago | root | parent | next |
Anyone who doesn't understand what's so difficult should read this:
https://wiki.c2.com/?WhyIsPayrollHard
Its from a different domain, but it gives you a flavour of the headaches you encounter. These systems always look simple from the outside, but once you get inside you find endless reams of interrelated and arbitrary business rules that have accumulated. There is probably no complete specification (unless you count the accumulated legal, regulatory and procedural history of the DVLA), and the old code will have little or no accurate documentation (if you are lucky there will be comments).
stego-tech a day ago | root | parent |
Basically this. The people running the show would desperately like to make it simpler, but ultimately it’s left overly complicated due to priorities from past leadership well above our paygrade.
The right solution is always to just rip off the bandaid and do it again by hand in a new language or platform, and to eliminate useless complexity while doing so. Unfortunately no leader would ever do this because the Board and/or Shareholders would crucify them for not outsourcing it to McKinsey first and using the fancy-pants automation tool their report recommended.
signal11 20 hours ago | root | parent | next |
There are a few shareholder-friendly patterns to get this done, but it is domain-specific. I’d say it’s more “rip off the bandaid slowly and carefully”.
Eg a common one is to wrap a new no-op new service around the old one, and gradually replace parts of the old one (the “strangler fig pattern”).
This is technically great, but it’s also financially great because you are don’t spending large sums on a big-bang rewrite. You’re spending relatively small sums on a “pay as you go” basis, something board members and shareholders do like.
But of course this depends on how your systems are set up.
pwagland a day ago | root | parent | prev |
Well, that, and any organization that has gotten themselves into this situation tend to have a very strong risk aversion principal. Which means they _can't_ approve something like this organisationally since there is simply too much risk embedded, and someone has to accept that.
abigail95 a day ago | root | parent | prev | next |
The code will be spaghettified and hideous. The queries will be nonsense.
That doesn't change the fact that the ultimate goal of the system is to manage drivers licenses.
> In my experience you can't rebuild such a thing "in 3 months".
Me and my team rebuilt the core stack for the central bank of a developing country. In 3 months. The tech started in the 70s just like this. Think bigger.
ajnin a day ago | root | parent | next |
Good for you, it means either your system was sufficiently simple to be fully implemented in 3 months from scratch including all business rules, or you build a new system which left out some amount of rules from the original system without this posing a problem. I don't know much about how central banks work so that might be possible. But not all systems present those characteristics.
skippyboxedhero a day ago | root | parent | prev | next |
For some reason those comments always seem to imply that every business doesn't have these problems too.
Every business has these problems. In most cases, the ones who don't change get swept away. The places that do not change are usually ones that can't go out of business. But every place has systems like this, you have to rebuild them, it isn't fun but there is no choice.
A tiny system like the DVLA is complex, hilarious (this is the same place that has had to reduce service provision because some staff just stopped turning up for weeks after Covid, public-sector productivity in the UK is at the same level as 1997, to just get to the same level as the private sector...which isn't growing productivity very fast...you would need to fire ~2m workers, the total workforce in the UK is 30m btw).
sytringy05 a day ago | root | parent |
I worked for an aircraft parts manufacturer, they closed an entire factory / production site rather than try and upgrade the manufacturing system or move the part production onto the new one they had implemented.
500 people out of work. Tell me again how simple everything is to fix.
robertlagrant 17 hours ago | root | parent |
No-one's saying everything is simple to fix.
mootothemax 18 hours ago | root | parent | prev | next |
> Think bigger.
One of the easier parts of this involves addressing, which in the UK is notoriously easy, reliable, and easy to process - especially the best-in-class Ordnance Survey stuff like AddressBase Premium, right?
A quick trawl of Github will shed some light on it - especially how much of a pain it is to get ABP into a usable state - and this is data that's core and integral to the service, the "are you a real user, a typo, a fraudster, a data supply issue, or getting things wrong in good or bad faith?" kind of business logic.
And it's doubly hard, because the government requires people to update their license when they change address - which often enough involves a new-build property, where the address (let alone UPRN - sometimes even the USRN!) is completely new to you.
Thinking bigger: imagine sitting at your desk during the first couple of weeks on the job, database validation checks running merrily in the background while you're staring at a screen. There's a mild frown forming on your face. You'd been scrolling over a list of rejected records in front of you, largely looking good - _how did they miss THAT fraud _ you'd briefly chuckled to yourself - but _this_ one...
It's a valid business entity, trading from the valid address, and you've hand-checked both _and_ got a junior who lives nearby to send you a photo of it, and, well, the wit running the business has decided to trade under the name _FUCKOFFEE_, and... that's... just going to have to be someone else's problem, you shrug.
(to be clear: the hard part of the DVLA project is _not_ implementing the coding, database, and systems design work)
robertlagrant 16 hours ago | root | parent |
You've sort of identified how to do it: break it up into problems.
Addresses are hard? Use https://postcodes.io or make your own - that's a project in its own right.
Separately out trading name from registered names needs to be an API from Companies House, or an internal service that API-ifies Companies House data.
Fraud detection? That needs to sit somewhere - let's break out all the fraud detection into a separate system that can talk to the other systems, and have it running continuously over the data. It'll need people to update fraud queries and also to make sure the other systems' data stays integrated with it.
Finally you need something on top that orchestrates the services and exposes them via a gov.uk website, and copes with things like "I don't have my address yet; can I use What3Words instead?" and another one with a UI and lots of RBAC and approvals for DVLA users to do lookups and internal admin.
mootothemax 13 hours ago | root | parent |
Heh, you’ve fallen into the exact trap I was trying to expose, which is why I chose addresses as an illustration point :)
The first step with anything address-y is to try and nail down exactly what an address is in the project context. Quick example - property shells, a building at 1-2 Street Name that contains a bunch of flats, but doesn’t itself have residents or its own postal delivery point. They’re mega useful for an address autocomplete (sadly, the vast majority of geocoders are trash for the uk’s addresses), are they sth people should be able to use (without a flat number etc.) for their driving license? Probably not. Commercial venues? Maybe, what about pubs? Ok, so dual-use maybe, but man this stuff gets painful in a hurry.
Next up - historic addresses and how’re you going to link ‘em all together. It’s nasty, edge-case-strewn work - and for the most part, unavoidably so. It’s why people get their backs up when someone dismisses it out of turn, cos if they have worked with it in the past, they’d qualify anything they wrote with: * presuming a well-formed address source + pipeline.
Edit: for what it’s worth, companies house only lists corporate entities and partnerships as defined in whatever act of parliament. Self employed etc can call themselves whatever - and do! - and the only record of it can be as vague as a nondescript line from the VOA.
robertlagrant 10 hours ago | root | parent |
I like this trap. Why would you need historic addresses for this service? In my mind the main reason for the DVLA knowing your address at all is so they know where to post a fine and a new driving licence/car documentation to. Why do they need historic addresses in their core system?
daveoc64 2 hours ago | root | parent | next |
>Why would you need historic addresses for this service?
The police (and other authorities like councils) who issue penalties, need to know who was the registered keeper of a vehicle on the date of an alleged offence.
That's where the DVLA's Keeper At Date Of Event (KADOE) system comes in.
It's currently being transitioned to a modern API:
https://developer-portal.driver-vehicle-licensing.api.gov.uk...
multjoy 8 hours ago | root | parent | prev | next |
Now you've got two addresses to handle - the vehicle's keeper and the licence holder.
Further, the DVLA isn't sending correspondence relating to criminal matters, that's coming from the police who use the Police National Computer, into which the driver and vehicle files are fed along with data from the motor insurers bureau.
mootothemax 8 hours ago | root | parent | prev |
I’m happy you’re taking it in the spirit intended :) it’s a trap I frankly despise but that’s cos I’m old and bitter.
The problem being addressed - if you’ll forgive the pun - is that you’re not storing someone’s current address; what you have is their _most recently known to us_ address, which obvs over time can become a problem, least of all if you’re wasting time and money sending undeliverable post. (I have a vague memory of Royal Mail fining bulk delivery users for not pre-screening, not sure if that was a particularly dull dream or not tho).
The thing it’s important to keep in mind is - there is no single nor centrally-held repository of addresses within the UK. I don’t mean about oh mr so-and-so lives at 11 acacia avenue. I mean for just the addresses themselves.
Throw in the mad mixture of Scotland having a separate national statistics agency that’s independent of the ONS, plus Northern Ireland having the same -plus- a separate OS in the form of OSNI, the whole landscape’s set up for pain and failure.
mattmanser a day ago | root | parent | prev |
Yeah, I always raise an eyebrow at attitudes like that too.
I've also reimplemented or gradually replaced several out-of-date systems. Albeit on a smaller scale.
In my experience, when you start picking the programs apart you find 90% of the code is redundant or boilerplate. Much of it isn't even called from anywhere, abandoned code, and can be deleted en masse. A lot of programmers don't clean code up "just in case" and then no-one else deletes it.
They can also often be vastly simplified because programmers back then didn't have the patterns and knowledge to write consisely.
I often find myself simplifying the original code first, which gets rid of 50% of it. Then I can see what the code actually does and rewrite it which gets rid of the other 40%.
On the other hand, many programmers don't have the patience, stubbornness or skill to do this kind of work.
And the ability to get through the major panic you have when you're half way through and wondering if you were mad to even start.
patrickmay a day ago | root | parent |
> And the ability to get through the major panic you have when you're half way through and wondering if you were mad to even start.
I feel seen, thank you.
Reubend a day ago | root | parent | prev |
> In my experience you can't rebuild such a thing "in 3 months". People who believe that are those that don't realize the complexity and the extraordinary amount of specifics, special cases, that are baked into the system, and any attempt to just rebuild from scratch in a few months hits that wall and ends up taking years.
Rebuilding a legacy system doesn't require you to support every single edge case that the older system did. It's okay to start off with some minor limitations and gradually add functionality to account for those edge cases.
Furthermore, you've got a huge advantage when remaking something: you can see all the edge cases from the start, and make an ideal design for that, rather than bolting on things as you go (which is done in the case of many of these legacy systems, where functionality was added over time with dirty code in lieu of refactoring).
jarofgreen a day ago | root | parent | next |
> Rebuilding a legacy system doesn't require you to support every single edge case that the older system did.
Depends on context.
This isn't some social media fun site where you can live with some rough edges; in this context "edge case" may be someone with an health condition who is still entitled to a drivers license; or it could be someone who normally could get one but due to a health condition really shouldn't be allowed one!
Analemma_ a day ago | root | parent | prev |
This generally isn’t true in the case of government systems. For the most part they are performing tasks that are required by law, and it is not acceptable to stop some of them, even temporarily. If you’re lucky you can run the old and new systems side-by-side while the 100% feature migration occurs, but that isn’t always feasible.
pixl97 a day ago | root | parent |
Ya it's funny looking at all these 'business' programmers that if the application doesn't work can just loose the customer/money to another competitor. In regulated stuff you have to serve everyone. Much worse if your systems don't work there are potential consequences where people die and or there are riots in the street.
robertlagrant 16 hours ago | root | parent |
It's the opposite. Most government systems have paper-based alternatives, which they will happily tell you to use instead if their system breaks. This exact article's title should've given you a clue here. Imagine if Netflix turned off at night. That would be totally unacceptable for a business, because you're their boss, but it's fine for government.
jdietrich a day ago | root | parent | prev | next |
Per their own data, the DVLA are responsible for the records of 52 million drivers and 46 million vehicles. Those records are immensely complex, because they reflect decades of accumulated legislation, regulation and practice. Every edge case has an edge case.
There's someone, somewhere in the bowels of the DVLA who understands the rules for drivers with visual field defects who use a bioptic device. There's someone who knows which date code applies to a vehicle that has been built with a brand new kit chassis but an old engine and drive train. There's someone who understands the special rates of tax that apply to goods vehicles that are solely used by showmen, or are based on certain offshore islands. God help any outsider who has to condense all of that institutional knowledge into a working piece of software.
Government does not have a good track record of ground-up refactors of complex IT systems. The British government in particular does not have a good track record. Considering all that, the fact that most interactions with DVLA can be done entirely online is borderline miraculous.
https://assets.publishing.service.gov.uk/media/675ad406fd753...
IshKebab a day ago | root | parent |
I would be really really surprised if the database actually encodes all of these edge cases you are thinking about in a structured way. In other words, I really doubt there's code like `if engine_age > drivetrain_age` or whatever.
pixl97 a day ago | root | parent |
The point is until you start ripping the application apart you have no idea what the internal logic looks like.
When you look you can find terrors that will haunt you in the night where some ancient limit, say number of columns in a database end up holding multiple structures that are getting if/then'd later in the application.
I would completely and totally believe there is code just like that.
MichaelZuo 4 hours ago | root | parent |
Why can’t you draw a flowchart/diagram of the logic?
Sure it may be so complex that a complete diagram would be dozens of feet across and tall on a real piece of paper.
But with apporpriate software tools that is not a limitation.
firefoxd a day ago | root | parent | prev | next |
Our systems took 8 hours to back up. Then it grew to 12 hours [0]. The system was a side project by an intern fresh out of college. Over the years, it grew into a crucial software the company relied on. I joined over 10 years later and was able to bring it down to few minutes.
shermantanktop a day ago | root | parent | prev | next |
It’s funny to me that I would never ask those questions. I’ve specialized in legacy rehab projects (among other things) and there seems to be no upper bound on how bad things can be or how many annoying reasons there are for why we can’t “just fix it.” Those “just” questions—which I ask too—end up being hopelessly naive. The answers will crush your soul if you let them, so you can’t let them, and you should always assume things are worse than you think.
TFA is spot on - the way to make progress is to cut problems up and deliver value. The unfortunate consequence is that badness gets more and more concentrated into the systems that nobody can touch, sort of like the evolution of a star into an eventual black hole.
abigail95 a day ago | root | parent |
I made a lot of money moving mid size enterprises from legacy ERP systems to custom in house ones.
The DVLA dataset and the computations that are run on it can be studied and replicated in 3 months by a competent team. From there it can be improved.
There is no way that this system requires 13 hours of downtime. If it required two hours - even if the code was generated through automation it can be reverse engineered and optimized.
It is absolute rubbish that this thing is still unavailable outside of 8am-7pm.
I maintain my position that it could be replaced in 3 months.
I got my start in this business when I was in university and they told us our online learning software was going offline for 3 days for an upgrade. Those are the gatekeepers and low achievers we fight against. Think bigger.
pixl97 a day ago | root | parent | next |
Ya I don't think I'd let you in two miles of a system like this.
Replacing legacy stuff always expands in scope far beyond the initial changes.
When you have to come back and add wait() entries in your new program because it spits data back faster than the old program ever could which then causes peripheral devices/drivers to crash which then pulls a dev and testers off something else important for days figuring out what kind of fresh hell is occurring is just status quo for ancient systems.
gunian 19 hours ago | root | parent |
idk much about dev much less legacy / enterprise dev but it seems like an A/B test type situation where you have 90% of the users running the legacy code and the remaining 10 on a new implementation would be feasible any idea why this is the case?
pixl97 10 hours ago | root | parent |
This is what happens. All that testing with the required stakeholders takes way way more time than you'd expect.
Gets even more fun in .gov where the work can change significantly at particular times of the year.
Had one piece of Windows software required by the State of Texas used at year end like once a year. Seemingly nobody realized windows updates had stopped it from working until a few weeks before the deadline. I had to setup a box without updates for it to run for my customer. Lead to a lot of panic around the state.
monkey_monkey a day ago | root | parent | prev | next |
> The DVLA dataset and the computations that are run on it can be studied and replicated in 3 months by a competent team. From there it can be improved.
Such an HN comment. Made me lol. Think funnier!
arccy a day ago | root | parent | prev |
it's a gov agency, they don't quite pay enough for a motivated competent team....
gunian 19 hours ago | root | parent |
even in this economy where people can't even work with their own SSN?
that_guy_iain a day ago | root | parent | prev | next |
> Edit: Why does rebuilding take a decade or more? This is not a complex system. It doesn't need to solve any novel engineering challenges to operate efficiently. Article does not give much insight into why this particular task couldn't be fixed in 3 months.
You do know the UK government has been cutting all their budgets to the bone for about 10 years? That means everywhere is pretty much understaffed.
And how do you know it's not a complex system? I would think that a system like that would be somewhat complex. It's not just driving licenses but a whole bunch of other things that are handled by the DVLA.
skippyboxedhero a day ago | root | parent | next |
Public-sector employment in the UK is at record highs. Despite apparently cutting inputs, productivity has collapsed to the same level as 1997 in the public sector. It is wildly overstaffed/overfunded by any estimation (and to be clear, there have been no cuts...the cuts in the early 2010s were not particularly significant, around 2-3% of GDP, public spending is as high as it has ever been in the UK, it was significantly lower under Blair...the only time it has reached this level is WW2 and 1975, the financial year the UK govt was bankrupted).
DVLA isn't complex. We live in a world of regulation, rules, and standards. Almost every large business does stuff like this at a global scale. It isn't complex, it just has to be complex so the budget is filled (and Fujitsu can get their contract).
that_guy_iain 18 hours ago | root | parent | next |
This is what we call cherry-picking stats. You pick a single stat but ignore everything else. Your comment seems to imply that the UK government has been lying about austerity for 10 years. While I don't trust the tory party and think they're corrupt with their deals-for-buddies approach, I don't think they outright lie for 10 years.
The policing budget is so bare-bones that the police have literally admitted they will not attend all 999 calls. To make that clear, they have admitted if you call them in an emergency they may not show up. NHS waiting times are sky-high. The number of NHS hospitals and beds are rock bottom. We can dig and dig into various public sectors and see them being terrible because of austerity. Which Labour is kinda being forced to continue due to the effects of covid on the finincals of the UK (and Brexit)
youngtaff 19 hours ago | root | parent | prev |
Public sector employment is at a high because we had to hire thousands of staff due to Brexit… probably the most stupid productivity destroying own goal a country has ever committed (so far)
abigail95 a day ago | root | parent | prev | next |
The system may or may not be complex but the data is has to store and transform is not. Because it handles drivers licenses. A function that has been done on pen and paper and filing cabinets.
Study the data, study the operations, reduce complexity.
Since you imply you know more about UK budgets than I do - how much is the DVLA budgeted for IT operations like this and how much more would you give them to expect this problem solved?
I can argue real numbers but vibes about bone dry budgets I cannot.
ellen364 a day ago | root | parent | next |
Are you suggesting that a process once done using pen and paper can't possibly be complicated?
I have no insight into the DVLA, but the idea that no paper process could ever be complicated is really funny. The UK enjoyed/loathed centuries of bureaucracy before computers were invented. At one point getting a divorce required an Act of Parliament specifically naming the unhappy couple! Being restricted to pen and paper hardly inhibited the human ability to create complex systems.
that_guy_iain a day ago | root | parent | prev |
> The system may or may not be complex but the data is has to store and transform is not. Because it handles drivers licenses. A function that has been done on pen and paper and filing cabinets.
It handles more than just driving licenses... The DVLA do more than just driving licenses.
> Since you imply you know more about UK budgets than I do - how much is the DVLA budgeted for IT operations like this and how much more would you give them to expect this problem solved?
It's not budgeted anything for this as far as I know. I believe it's handled by Government Digital Services which handles lots of the digital services for various departments. The budget for all of GDS is about 90 million most of which isn't for .gov.uk. A rewrite of that size I would expect to cost about 50-60 million in total but take several years.
skippyboxedhero a day ago | root | parent |
Any business which took 50-60m to rewrite something that simple wouldn't exist.
kalleboo a day ago | root | parent | next |
I know of a large international retailer everyone here has heard of that had at least 3 aborted attempts to rewrite their 70's VAX-based warehouse management software stack (an internal attempt, an Indian outsource attempt and an SAP attempt) they spent that much at least with zero to show for it. They had all the issues with "too many sales in one day causing batches to not finish" and "but this warehouse wants to open 24/7..." and still couldn't get it rewritten
gunian 19 hours ago | root | parent |
couldn't get it rewritten as in the logic was impossible or more like with a specific budget, random set of KPIs etc?
idk for some reason I feel like projects like this should be allowed more flexibility by shareholders
that_guy_iain 19 hours ago | root | parent | prev | next |
That simple? You probably couldn't even list all the things the DVLA is responsible for. I also doubt you could list all the laws and policies relating to driver licenses.
And you would be shocked at how expensive big rewrites actually are.
nsteel a day ago | root | parent | prev |
What? Like Fujitsu? Still going strong bidding on gov contracts and providing garbage.
nxobject a day ago | root | parent | prev |
I think it's "local councils, yes cut to the bone; Whitehall and its satellites, no". Similarly, Whitehall ended out the Thatcher and Major ministries with more regulators for privatized industries, more centralized decision-making, and a larger bureaucracy than ever.
youngtaff 19 hours ago | root | parent |
Central government departments have been cut too… the staff needed for the Brexit disaster disguise this
knallfrosch a day ago | root | parent | prev |
The problem is that the full set of specifications accumulated over three decades of usage is exactly as complicated as the code that still runs.
Just wait 10 more years and hope AI can solve it.
In the meantime, people can't renew their driver's license at 3:36. So what? Is that a requirement?
rozab 2 days ago | prev | next |
I've often ran into this when using DVLA services and spluttered with indignation. But at the end of the day, these services are fantastically usable (during the daytime) and I appreciate Dafydd pushing to just get them out there!
I got my license in 2015 so never in my life have I had the apparently ubiquitous American experience of queuing at the DMV and filling in paper forms. (is this still real? or limited to stand-up comedy?)
AlotOfReading 2 days ago | root | parent | next |
Queuing at the DMV and filling out paperwork is very much a real thing that still happens. It's a pretty different experience in every state though.
ChocolateGod 2 days ago | root | parent |
Can it not be done online like in the UK?
neckro23 a day ago | root | parent | next |
Usually, but it depends on the state. Remember, America isn’t a country, it’s 50 countries in a trenchcoat.
It’s often a mishmash of services too. I was told in-person at the DMV that I couldn’t renew my registration since I’m not the registered owner of my car. So I just went to a DMV kiosk at the local grocery store and did it there without a hassle.
mystified5016 a day ago | root | parent | prev |
Yes, with a very long list of exceptions which means that many people end up needing to go in person for common situations.
nsxwolf 2 days ago | root | parent | prev | next |
The queues have been mostly replaced with "take a number" systems where you can sit down and wait... with your... papers... that you had to fill out first...
fn-mote 2 days ago | root | parent |
> The queues have been mostly replaced with "take a number" systems where you can sit down and wait...
My recent experience was: sign up online and get a 30 min window (9:00-9:30 say). Queue everyone for that 30 minute window outside the building. At exactly 9:30, enter and go through the usual queues inside. The advantage is that getting through those queues now takes 30 minutes or less because their length is limited. Presumably we/they traded volume of processing for certainty of time spent in the queue. A very familiar tradeoff for a computer scientist.
skippyboxedhero a day ago | root | parent |
The UK has a similar situation for hospitals. Your driving licence system sounds like a great embarrassment though.
nsxwolf 7 hours ago | root | parent |
And the employees are mean! Really mean!
snakeyjake a day ago | root | parent | prev |
My US state, one of the ones NOT living in the past, does almost everything online.
The only times you have to come in are:
1. for your first license, either as a newly-licensed driver or an out-of-state driver who recently moved
2. if you were bad and broke the law or otherwise had your license cancelled/revoked/suspended
Even those people have to call or go online to make an appointment.
All other tasks from getting/returning plates to requesting a duplicate title can be done online, though drop boxes, or by mail.
I have been to the DMV three times since 1995: once to turn my out-of-state license into an in-state one, once to turn that drivers license into a realID-compliant one, and once to have my fingerprints taken for a concealed carry permit.
robertlagrant 17 hours ago | prev | next |
The cause is easy: the people specifying the problem to be solved don't ask for an independent set of tests that are human-readable and can be run automatically to verify the system.
If they did that, then the system would be verifiable and so would changes to it - the tests would simply need to be adapted to talk to the new version of the system.
Too late now, of course. But that's what should be done.
arjie a day ago | prev | next |
While these explanations are plausible, certain other things I've encountered make me believe that deeper reasons underlie even these reasons. When I lived in the UK in 2017 as a foreigner, all applications for a driving licence as a foreigner on a T2-ICT visa had to be sent over for a couple of weeks and you had to include your passport and Biometric Residence Permit and everything. By comparison, I was able to get my driving licence at the California DMV pretty easily even as a foreigner and my passport and so on were photocopied and not retained. This drastic difference in service ability between the DVLA and a notoriously disliked American government service lead me to believe that the proximal technical causes for this are downstream from organizational choices for how to deliver service.
robertlagrant a day ago | root | parent |
> downstream from organizational choices for how to deliver service
100000%. They're a monopoly service you must interact with or get fined and (eventually) locked up. They have zero incentive to do a particularly good job. Some orgs in this situation are just well run and do a good job, but there's no competitive pressure for them to do so.
Y_Y a day ago | root | parent |
And there are pressures other than competition, and some people just want to do a particularly job just because it's their job.
robertlagrant 18 hours ago | root | parent |
Yeah, as I say. But the other pressures are often temporary and political.
pestatije 5 days ago | prev | next |
DVLA - Driver & Vehicle Licensing Agency
plus, since im already posting a comment: its because there is no batch window to process transactions
delta_p_delta_x a day ago | prev | next |
Some DVLA services don't work in the day, too. Case in point, the 'get a share code' service: https://www.viewdrivingrecord.service.gov.uk/driving-record/...
This doesn't explain why the website can't just validate the input, queue up the transaction, and execute it later when the dB is available. It's not like your record changes that often. If the transaction fails they can then just send you an email to try again during working hours. All these processes used to work by paper forms, so most of them don't need interactive processing - that's why they have a batch job in the first place.
My GP surgery has the same issue with non-urgent requests. It's entirely input, it's definitely not looking in a dB because it doesn't even ask who you are until the last step. And yet it won't accept an input except during working hours. Madness.
glonq a day ago | prev | next |
This sounds a bit familiar. I used to work at a medium-sized company whose systems were based on COBOL code and Unisys mini/mainframe hardware from the 80's. We even had a person employed as a "tape ape"; thankfully not me. Throughout the next decade or two they tried various 4GL-generated facades and bolt-ons but could never escape from that COBOL core. Eventually I think they migrated the software to some kind of big box that emulated the Unisys environment but was slightly more civilized. I have no idea whether they ever eradicated all the COBOL though.
mike_hearn 2 days ago | prev | next |
tl;dr same reason other services go offline at night: concurrency is hard and many computations aren't thread safe, so need to run serially against stable snapshots of the data. If you don't have a database that can provide that efficiently you have no choice but to stop the flow of inbound transactions entirely.
Sounds like Dafydd did the right thing in pushing them to deliver some value now and not try to rebuild everything right away. A common mistake I've seen some people make is assuming that overnight batch jobs that have to shut down the service are some side effect of using mainframes, and any new system that uses newer tech won't have that problem.
In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes. A classic example is in banking where the ordering of these jobs can change real world outcomes (e.g. are interest payments made first and then cheques processed, or vice-versa?).
In other cases it's often easier for users to understand a system that shuts down overnight. If the rule is "things submitted by 9pm will be processed by the next day" then it's easy to explain. If the rule is "you can submit at any time and it might be processed by the next day", depending on whether or not it happens to intersect the snapshot taken at the start of that particular batch job, then that can be more frustrating than helpful.
Sometimes the jobs are batch just because of mainframe limitations and not for any other reason, those can be made incremental more easily if you can get off the mainframe platform to begin with. But that requires rewriting huge amounts of code, hence the popularity of emulators and code transpilers.
ndriscoll a day ago | root | parent | next |
Getting rid of batch jobs shouldn't be a goal; batch processing is generally more efficient as things get amortized, caches get better hit ratios, etc.
What software engineers should understand is there's no reason a batch can't take 3 ms to process and run every 20 ms. "Batch" and "real-time" aren't antonyms. In a language/framework with promises and thread-safe queues it's easy to turn a real time API into a batch one, possibly giving an order of magnitude increase in throughput.
mike_hearn a day ago | root | parent |
Batch size is usually fixed by the business problem in these scenarios, I doubt you can process them in 3msec if the job requires reading in every driving license in the country and doing some work on them for instance.
ndriscoll a day ago | root | parent |
This particular thing might be difficult to change because it's 50 year old COBOL or whatever, but my point was more that I've encountered pushes from architects to "eliminate batches" and it makes no sense. It just means that now I have to re-batch things in my code. The correct way to think about it is that you want smaller, more frequent batches.
Do they really need to do work on all records every night? Probably not. Most people aren't changing their license or vehicle info most days. So the problem is that somewhere they're (conceptually) doing a table scan instead of using an index. That might still be hard to fix, but at least identify the correct problem. Otherwise as you say moving to different tech won't fix it.
abigail95 a day ago | root | parent | prev | next |
Do you know why the downtime window hasn't been decreasing over time as it gets deployed onto faster hardware over the years?
Nobody would care or notice if this thing had 99.5% availability and went read only for a few minutes per day.
roryirvine 16 hours ago | root | parent | next |
Most likely because it's not just a single batch job, but a whole series which have been scheduled based on a rough estimate of how long the jobs around them will take.
For example, imagine it's 1997 and you're creating a job which produces a summary report based on the number of total number of cars registered, grouped by manufacturer and model.
Licensed car dealers can submit updates to the list of available models by uploading an EDIFACT file using FTP or AS1. Those uploads are processed nightly by a job which runs at 0247. You check the logs for the past year, and find that this usually takes less than 5 minutes to run, but has on two occasions taken closer to 20 minutes.
Since you want to have the updated list of models available before you run your summary job, you therefore schedule it to run at 0312 - leaving a gap of 25 minutes just in case. You document your reasoning as a comment in the production control file used to schedule this sequence of jobs.
Ten years later, and manufacturers can now upload using SFTP or AS2, and you start thinking about ditching EDIFACT altogether and providing a SOAP interface instead. In another ten years you switch off the FTP facility, but still accept EDIFACT uploads via AS2 as a courtesy to the one dealership that still does that.
Another eight years have passed. The job which ingests the updated model data is now a no-op and reliably runs in less than a millisecond every night. But your summary report is still scheduled for 0312.
And there might well be tens of thousands of jobs, each with hundreds of dependencies. Altering that schedule is going to be a major piece of work in itself.
kalleboo a day ago | root | parent | prev | next |
Why would they spend the money to deploy it on faster hardware when the new cloud-based system rewrite is just around the corner? It's just 3 months way, this time, for sure...
mike_hearn a day ago | root | parent | prev |
It doesn't get deployed onto faster hardware. Mainframes haven't really got faster.
ndriscoll a day ago | root | parent | next |
Mainframes have absolutely gotten faster. They're basically small supercomputers.
throw16180339 a day ago | root | parent | prev | next |
You're mistaken about this. IBM's z-series had 5GHz CPUs well over a decade ago and they haven't gotten any slower.
abigail95 a day ago | root | parent | prev |
It must be. Maintaining the original hardware would be more expensive that upgrading to compatible but faster systems.
mike_hearn a day ago | root | parent |
What compatible systems? Mainframes are maintained in more or less their original state by teams from IBM. They are designed to be single machines that scale vertically and never shut down, every component can be hot-swapped including CPUs but IBM charge a lot for CPU capacity if I recall correctly. Given that nighttime doesn't get shorter, the DVLA probably don't see much reason to pay a lot more for a slightly smaller window.
And mainframes from the 80s are slow. It sounds like they're running on the original.
ndriscoll a day ago | root | parent |
Newer mainframes are still faster than older mainframes, and can have hundreds of cores and 10s of TB of RAM. A big part of IBM's draw is that they make modern systems that will continue to run your software forever with no modifications. I had an older guy there tell me a story about them changing a default in some ISPF panel, and customers complained enough that they had to change it back. Their storage systems have a virtualization layer for old programs that send commands to move the heads of a drive that hasn't been manufactured for 55 years or whatever and translate that to use storage backed by a modern RAID with normal disks. The engineers in the mainframe groups know who their customer base is and what they want.
It's unlikely that they're literally using 40 year old hardware since the replacement parts for that would be a nightmare to find and almost certainly more expensive than a compatible new machine.
mschuster91 18 hours ago | root | parent | prev |
> In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes.
That right here is the meat of so many issues in large IT projects: large corporations or government are very, very skeptical about changing their "established processes", usually due to "we have always done it this way". And no matter how often you try to explain them "do it a tiny bit differently to get the same end result but MUCH more efficiently" you'll always run into walls.
ForHackernews 2 days ago | prev | next |
Unpopular opinion, but I think many systems would benefit from a regular "downtime window". Not everything needs to be 24/7 high availability.
Maybe not every night, but if you get users accustomed to the idea that you're offline for 12 hours every Sunday morning, they will not be angry when you need to be offline for 12 hours on a Sunday morning to do maintenance.
The stock market closes, more things should close. We are paying too high of a price for 99.999% uptime when 99.9% is plenty for most applications.
jmwilson a day ago | root | parent | next |
Who works Sunday morning then?
The maintenance window will morph into a do-big-risky-changes window, which means everybody in engineering will have to be on-call. Many years ago, when I newly joined a FAANG, I asked, "shouldn't I run this migration after hours when load is low?" and the response was firm, "No, you'll run it when people are around to fix things". It may not always be the answer, but in general, I want to do maintenance when people are present and willing to respond, not nights and weekends when they're somewhere else and can't be found.
crazygringo a day ago | root | parent | prev | next |
> Not everything needs to be 24/7 high availability.
If it makes you more money to be available 24/7 then why wouldn't you?
> Maybe not every night, but if you get users accustomed to the idea that you're offline for 12 hours every Sunday morning
Then I would use a competitor that was online, period.
Imagine Sunday morning if the only time you have to complete a certain school assignment, but Wikipedia is offline? Or you need to send messages to a few folks that they need to see by the evening, but the platform won't come online until 3pm, which means you'll need to interrupt your afternoon family time instead?
Maybe things closing works fine for your needs and your schedule. But it sure won't for everyone else. Having services that are reliable is one of the things that distinguishes developed countries from developing ones.
corint a day ago | root | parent |
> If it makes you more money to be available 24/7 then why wouldn't you?
Agreed, but for a government service where you update your license, or tell them about selling a car or something, there's no real 'more' money. Being closed at 3am doesn't lose the opportunity in the way that it would if you were selling widgets. It instead forces the would-be users at 3am to wait until the morning.
OJFord a day ago | root | parent | prev | next |
It only really works where the audience is already limited in country/timezone though. Sure a global service could just stagger the downtime around the world.. but (unless you've already equivalent partitioned the infrastructure) then you're just running 24/7 with arbitrary geofencing downtime on top.
kragen 2 days ago | root | parent | prev |
Basically this happens because the DVLA and the stock market don't have any competition. Customers in a competitive market won't be angry when you need to be offline for 12 hours every Sunday morning; they'll just switch to your competitor some Sunday, because the competitor is providing them something they value that you don't provide.
ajnin a day ago | root | parent | next |
The stock markets definitely have competition. For instance Frankfurt, London, Paris or Amsterdam very much compete with each other to offer desirable conditions for investors, and companies will move their trading from one to another if it is their interest. I think the fact they close at night is a self-preservation mechanism, traders would become insane if they had to worry about their positions 24/7.
ForHackernews 2 days ago | root | parent | prev |
Maybe they should regulate Sunday trading hours, or unionized sysadmins should negotiate the end of on-call hours.
The red queen's race that you describe for ever-greater scale, ever-greater availability is an example of the tragedy of the commons. Think how much money and many human minds have been wasted trying to squeeze out that last .0001% of "zero downtime" when they could have been creating something new.
"Keep doing the same thing, but more of it, harder" is a recipe for a barren world of monoculture.
lifeoflejf 2 days ago | root | parent | next |
Bergen county NJ has blue laws that make it so non-grocery stores must be closed on Sunday’s. Maybe there’s some value in structuring a time where everybody is off?
Just like at work the only time I really get off is when all of my customers are off. It’s nice when the industry sorta shuts off for a week or so around christmas
kragen a day ago | root | parent | prev | next |
Something like that might plausibly be correct, though you've exaggerated it to a level where it's clearly false.
If we steelman it to its most defensible essence, I think what you're saying is that the cost of the human effort needed to provide these higher uptimes exceeds the consumer benefit (the value of being able to buy a camera on Saturday), say. You could imagine, for example, that each incremental improvement in uptime wins over a proportion of the customer base providing a value that vastly exceeds its cost — but only until your competitors improve their own offering to match, so all the surplus from all this uptime improvement ultimately goes to the consumers, not the producers.
There are two related holes in this idea.
The first is that producing consumer surplus is what the economy is for, in a moral sense. The reason producing goods and services is a good thing to do is so that someone will benefit from using them! So if all the effort that sysadmins make goes into making services better for users, that's a good thing, not a bad thing.
The second is that nothing is stopping a new entrant from offering a new, low-cost service that isn't as reliable. If the cost of providing all that extra reliability (bundled into the incumbents' pricing scheme) is higher than the actual benefit to users, the users will switch to the lower-cost, less-reliable service. This has happened many times, in fact: less-reliable minicomputers stole business from mainframes, less-reliable VoIP stole business from ATM and SONET and SDH, all kinds of less-reliable plastic goods have stolen business from all-metal versions, and now solar panels are stealing business from coal power plants even though solar panel "uptime" is like 30%.
So the particular market dynamics we're talking about actually sensitively optimize the amount of effort given to uptime to the economic optimum. There do exist lots of market failures, but the particular dynamic we're discussing is the opposite extreme from something like a dollar auction.
abigail95 a day ago | root | parent | prev |
Who is trying to achieve zero downtime? Facebook has degraded service regularly it's just close enough to 99.9 that nobody cares.
If loading my messages times out I just move onto something else and go back a few minutes later.
Surely they have metrics measuring that and don't think it's worth the engineering effort to improve it.
kragen a day ago | root | parent |
One of the interesting things that came out of Google's "SRE" system is that they deliberately add outages if they don't have enough. They learned years ago that if you build a service that promises 99% uptime and deliver 99.99% uptime, other people in the company will come to depend on that 99.99% uptime unintentionally. So they chaos-monkey it to ensure that the inevitable failures aren't catastrophic.
neuroelectron a day ago | prev | next |
I'm sure the upgrade would have been trivial for a competent expert to do but instead they outsourced it to a big software firm and surprise, it went over-budget. Seriously, what could this database be doing that's so complicated?
IOT_Apprentice a day ago | prev | next |
This seems weird to me. The number of records is minuscule compared to internet scale tech.
The data model for this sounds like it would be simple. Exactly how many use cases are there to be implemented?
Build this with modern tech on HA Linux backends. Eliminate the batch job nonsense.
This could be written up as a project for bootcamps or even a YouTube series.
I suspect some internal politics about moving forward and clinging to old methods is at hand.
Perhaps someone could build an open source platform if the requirements were made public.
dhosek a day ago | root | parent |
The thing is that a lot of internet scale stuff tends to be non-critical. It’s not a big deal if 1% of users don’t see a post to a social network site. It’ll show up later, maybe, or never, but nobody will care.
On the other hand, with transactions like banking or licensing or health insurance, it’s absolutely essential that we definitely maintain ACID compliance for every single transaction, which is something that many “internet-scale” data solutions do not and often cannot promise. I have a vague recollection of some of the data issues at a large health insurance company where I worked a couple years ago that made it really clear why there would be an overnight period where the system would be offline—it was essential to make sure that systems could be brought to a consistent state. It also became clear why enrolling someone in a new plan was not simply a matter of adding a record to a database somewhere.
Not to mention that I suspect that data such as bank transaction records or health insurance claims probably rival “internet scale” for being real big data operations.
mh- a day ago | root | parent |
The reason that these "internet scale" solutions are challenging to operate is because of their latency and availability targets.
If you threw into the requirements "can go down nightly, for hours, for writes AND reads", they could absolutely provide the transactional guarantees you're looking for.
JSTrading 4 hours ago | prev |
[flagged]
simonbarker87 a day ago | next |
Excellent bit of pragmatism and as a user of this service I’m happy with the trade off.
People wondering why it’s not a simple switch and “there must be something else going on here” have clearly never worked with layers of legacy systems where the data actually matters. Sure it’s fixable and it’s a shame it hasn’t been but don’t assume there aren’t very good reasons why it’s not a quick fix.
The gov.uk team have moved mountains over the past decade, members of it have earned the right to be believed when they say “it’s not simple”.