Can We Achieve 10× Productivity in Employment Law?

We should start with the biggest question people are asking themselves.

May 13, 2025

Why This Matters Now

Workplace regulations multiply yearly while resources stay flat. Ten‑fold productivity is becoming the goal for Fortune 1000 legal teams protecting complex global workforces. AI is held up as a promising critical technology towards that goal. CEOs such as Shopify’s Tobias Lütke are starting to proclaim that AI is now mandatory and staff need to prove that jobs can’t be done by AI before asking for more headcount.

Let’s dig deeper into this. Is this entirely hype?

What the Latest AI Tools Actually Do

Not everyone is studying the latest announcements from OpenAI, Microsoft, Google and Anthropic. If you’re not sure that you’ve seen or experienced the capabilities of the latest AI tools, I wanted to give you a quick way to get up to speed. These are quick to read/watch (and let me know what you think in the comments):

Rapid research & drafting. See Anthropic’s Claude in action - LinkedIn post
Watch NotebookLM use your uploaded documents to make a summary, or even a podcast, Youtube demo
OpenAI’s Deep Research: Youtube video

These examples produce a great deal of excitement! Early adopters report how much fun they’re having and the dazzling speed from these solutions.

However, they also report uneven accuracy and clunky UX.

Anecdotally, individuals report:

“It’s full of mistakes or omissions”
“We did the pilot but it wasn’t quite there”
“We would throw these white papers into NotebookLM for a briefing. The usability is mediocre as an enterprise enabler”

So what is the truth? Is is great or is it bad?

Several organizations run more thorough assessments, beyond the anecdotes. Here’s a recap of some recent benchmarks for accuracy data and known weaknesses.

Vals.ai also provides benchmarking on a regular basis, reporting on the performance of the raw models. Here’s the two benchmarks that they run in the legal domain, for contract law and case law.

https://www.vals.ai

As of today (4-18-2025), you can see from Vals.ai, the winning LLMs in each case are different!

Grok is #1 for Caselaw - yet #10 for contract law (Grok is the model developed by Elon Musk xAI).
OpenAI fares well, but isn’t in the top three for either.
Anthropic also fares well in position #5 for case law and #4 for contract law. Remember though, that this is only the performance of the raw models. A software system’s performance is more than just the performance of whatever model it uses. Most of the work done by a software system is about taking the raw capability and making it even better for the intended use case, to improve criteria such as accuracy, consistency, cost or speed.

Let’s dig in a little bit deeper: This is such powerful technology, what is going wrong?

Why Do Good Demos Become Disappointing Pilots?

Data relevance.
The first thing to understand is that LLM models grab whatever they have access to. Whether it is internet content or content that sits in a shared drive like Google Drive or Dropbox, they grab content regardless of quality, vintage, or superseded status. In some use cases this doesn’t matter. But in the legal field it matters, a LOT. Better decisions depend on being thorough in vetting all the right information, often finding the needle in a haystack. A reliable system for legal users will include additional algorithms that help to identify what documents are most relevant. Which are higher quality and should have more prominence? Which are the most recent? Which are obsolete or superseded? This area of technical work is referred to as grounding (see https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview). (as a side note, Fern does this for employment law; Harvey do this for other areas of law, especially M&A matters).
Unexplainable answers.
Another issue has been that you receive conclusions without the audit trail required to be defensible. Legal users are not able to use confident-sounding content unless they can have a high degree of verifiability. If AI is suggesting an action the user needs to understand why. If it’s a black box, it will not be relied on when it matters.
Skills.
Speed isn’t the only factor. Speed does not grow the judgement your team needs for high‑stakes calls. Individuals must continually increase their own experience and knowledge, so they are better prepared for doing higher level work that involves judgement. In the legal field an ideal system helps you get wiser, not just faster. You don’t want to just have the answer dropped on your lap. Breaking down the thought process in ways users can participate in knowledge creation, and understand how you arrive at answers, is very important.

Conclusion

AI holds promise for 10x productivity in employment law, but current tools have limitations. While demos can be impressive, pilots often reveal issues with accuracy, poor data relevance and unexplainable answers. They are addressable issues. When you’re assessing tools, ask your vendor about how they approach these issues.

About Fern AI for Employment Law

We started Fern to work collaboratively with a network of senior employment professionals to move the entire industry forward. We partner with senior employment specialists to build AI solutions and workflows to reduce risks such as penalties, workforce disputes, litigation and bad press.

Our focus is on helping employment professionals drive three core objectives:

Faster execution
Making better decisions throughout their organization
Getting wiser and more skilled (as humans)

This blog series is about helping practitioners keep on top of the possibilities of using AI in your work. We’re looking for practical ways AI can sharpen decisions, accelerate execution, and make your team wiser—without compromising rigor. You can always contact me at sunita@fernlabs.ai to suggest topics you want addressed or to share your thoughts.

Learn with Fern

Discussion about this post