Why Generative AI Scares the Hell Out of DBAs

Last Updated May 13, 2024

I was chatting with a client’s DBA about this thought-provoking blog post about data governance in the age of generative AI. The DBA’s concern was, “What if we hook up generative AI tools to the production database, and someone asks for, say, recommended salary ranges for our next CEO based on the current CEO’s salary? Will the AI tool basically bypass our row-level security and give the querying person information that they shouldn’t be allowed to see?”

And she had a really good point.

If you haven’t worked with AI tools yet, I don’t blame you. You’ve got a real job, dear reader, unlike me who’s just a consultant who gets to travel all over the world while reading cool blog posts on planes. Anyhoo, since I get to play around a lot more, I’ll give you a quick recap:

Large language models (LLMs) like ChatGPT take plain English requests and turn them into results
Those results aren’t just text: they can include tables of results, scripts with T-SQL commands, or JSON files
The LLMs are trained on publicly available data
You can enhance their training by providing additional data, like the contents of your file shares and databases
There’s almost zero security on LLM requests and result sets

So the scenario that scares the hell out of me is:

A power user signs up for an LLM service or runs one on their computer
The power user wants better results, so they provide additional training data: they point the LLM at the company’s data set and load it with financial results, customer info, employee salaries, etc
The power user loves the report results, so they give other people access to the LLM

And the end result is that anyone who queries the LLM suddenly has access to everything that the power user had access to at the time of the LLM training. Anyone with access to the LLM might ask for a list of customers, employee salaries, or whatever.

That would be what we call “bad.”

Sure, in a sense, this is the same problem data professionals have been struggling with for decades: people can export data, and once it’s out of the database, we don’t have control over who gets to see it. This isn’t new. It’s been the same story ever since Larry Tesler invented copy/paste.

But what’s new is that large language models:

Don’t make it clear where their data comes from
Aren’t easily reverse-engineered
Have damn near zero security on inputs and outputs

So now, the Amazon blog post explains why smart people are going to burn cycles reinventing row-level security. I’m not saying the blog post is bad – it’s not! It’s just a damn shame that the next 10-20 years of data governance are going to look exactly like the last 10-20 years of data governance.

Spoiler alert: we’ve sucked at this kind of security for decades, and we will in the next decade, too.

I’m Getting Antsy for SQL Server vNext.

Getting Data Out of an Audit Table: Answers & Discussion

9 Comments. Leave new

Dave Wentzel
May 14, 2024 3:31 pm

It doesn’t exactly work that way. The LLM has trained data and you can’t (cost-effectively) “re-train” it on your data. They are “language models”, not “truth models”…meaning they are good at doing things in English (or another language…say SQL) but they don’t know facts. The fact that they “appear” to know facts is just anthropomorphizing how humans learn language…they listen, generally to those who claim to know facts, and learn the patterns of “language”

If you point an LLM at a database it will ONLY have access to the “facts” (the grounding data) that the user that connects to the LLM would have access to. This is no different than any front-end UX into a database, be it a nodejs app, excel, pbi, or SSMS. If you wireup any of those interfaces using the sa password, bad things could happen. No different with LLMs.

So, the LLM itself kinda needs no additional security over and above what you already have on your db, your collection of HR pdfs, your sharepoint lists, or your SFDC CRM system.

>>The power user wants better results, so they provide additional training data: they point the LLM at the company’s data set and load it with financial results, customer info, employee salaries, etc

Semantics again, but it isn’t training data, the user is providing additional “grounding” data. Where this is a HUGE problem is …

>>A power user signs up for an LLM service or runs one on their computer

…then sends the company’s private IP to the LLM service. You really gotta read the TOS…is the LLM service using your data to possibly train your competitor’s model? There are documented cases of this. Stackoverflow contributors are up in arms over this. Taylor Swift is too, etc. Possibly the simple solution is to blacklist the public versions of the LLMs like chatgpt and deploy an internal version where YOU have control of how you use the data without possibility of exfiltration. chatgpt ENT, Azure OpenAI, or any model deployed in YOUR cloud provider under YOUR control where YOU read the TOS…all good choices.

>>The power user loves the report results, so they give other people access to the LLM
This wouldn’t matter b/c you would give other people access to the LLM but not the grounding data.

>>Will the AI tool basically bypass our row-level security and give the querying person information that they shouldn’t be allowed to see?”

It doesn’t work that way. The LLM runs in the security context of the given security principal. If you tell the LLM to query the data with your sa password….that’s on you. These things neither help nor hinder security. It’s just another UI, albeit one that seems magically and can take your query results and display them in Shakespearean iambic pentameter.

The actual “pattern” is called RAG…retrieval augmented generation. Roughly, I get the grounding data from somewhere…a search index (in the case of pdfs), my CRM (via API calls possibly), or my database (in the case of salaries), then I pass that data to the LLM and tell it specifically to NOT use the data you were trained on but instead use the data I just retrieved for you as the basis of your facts.

In a nutshell, think of these things this way:
* it’s just another UX, no different than PBI, excel, or a front-end webapp. Secure it the same way.
* these things only know language, not facts, but make sure you aren’t giving away your “facts” to your LLM vendor unwittingly.

Reply
- Brent Ozar
  May 14, 2024 3:37 pm
  
  Thanks for the detailed comment! I tried to keep things simple for the context of the post, and I totally agree with most of what you said.
  
  I think the big problem is the obvious nutshell at the end:
  
  > it’s just another UX, no different than PBI, excel, or a front-end webapp. Secure it the same way.
  
  Bingo. The same things we’ve struggled with securing for decades, just like how I sum up my post. LLMs are an entirely new surface area, and everybody’s focused on easy accessibility right now – not realizing where the data is going, and the complete lack of security around that new front end.
  
  Reply
  - Dave Wentzel
    May 15, 2024 3:03 pm
    
    Yeah. That’s the message. I hear this everyday…”But our pdfs and data isn’t properly secured and now you want us to put OpenAI over it?” Um, no. You got bigger fish to fry. A corollary is this : “We want to put OpenAI over our HR pdfs but the problem is we have 20 versions of each pdf in sharepoint today and when someone searches we are never sure if they are seeing the latest or one that is outdated. With OpenAI, what do we do?” Same problem…um…you got bigger fish to fry.
    
    Best to get your house in order, real quick.
    
    Reply
Matthew H Iskra
May 14, 2024 5:45 pm

I’ve experienced this “CHAT GPT T-SQL” problem, but not the one as dire as you predict. My firm is a junior partner on a major project, and we handle the primary OLAP database and the Data Warehouse.

A “Power User” – note-quotes – in our senior partnered firm decided to try this whole “use AI to generate T-SQL” thing. The scripts were the range from” simple and OK” to “OMG what were you thinking!”. He used one of the scripts that was so convoluted that it pinged our quite considerable DB resources and our production process. Alarms went off and I jumped on to check what was happening. Used your scripts (Thank you so much, Brent) to find the query and kill it, KILL IT WITH VIRTUAL FIRE, and disable the account – which honestly was me being spiteful.

He was mildly reprimanded, apologized, and is one miss-step away from losing his access that he allegedly needs for analysis. The T-SQL that he generated was intentionally quite complicated as he wanted to test the limits, and oh boy he sure found them – both in the system limits and the management limits. Also, the T-SQL made no sense for a result no matter how syntactically correct it was. I think it even did ANSI joins and old-style join-via-the-where-clause in the one query. You lose style points forever, AI. Go back to Oracle 7. No, MS Access.

I really should have saved the query. It was SO bad, like Star Wars Christmas Special bad.

Reply
- Michael
  May 15, 2024 8:05 am
  
  That’s a great example. The thing with AI models is that they’ll just get better from here though.
  
  What you see now is just a glimpse of what we’ll going to have in 2, 5 or 10 years.
  I bet by then, you can easily generate complex SQL scripts where the AI optimizes parts via trial and error together with you, and, at the same time, prevents you from asking for dumb things like “testing the limits”.
  
  Reply
- Dave Wentzel
  May 15, 2024 3:26 pm
  
  I’m frankly surprised to hear that. Generally OpenAI does real good with queries. My day job is writing “RAG pattern code over databases”. There are “tricks” to getting good SQL queries and it’s too much for a blog comment. It’s likely you or your Power User missed a few steps. As Michael said in another comment “this is a glimpse of the future”. In many respects, OpenAI-as-SQL-generator is much like an ORM…they function better when you give them good metadata and your data model is halfway SANE. It might behoove you to help your POWER USER understand a little more about the problem they are trying to solve.
  
  I’ll tell you this…if your job title is “SQL Developer” you should likely consider reframing your role to something more akin to “Business and Data Analyst”. You WANT to show an acumen for business problem solving and rely less on tool-based knowledge. There will always be a place for good, performant hand-crafted SQL, but that job description is getting “shifted left”.
  
  A quick anecdote: I work on what is called “GraphRAG”, which is asking NL style questions against knowledge graph databases. Said differently, I want a user to be able to write a OpenAI prompt similar to “Go look at my customer service call logs and tell me the top five things customers are bitching about” and it should be able to formulate a response that allows the user to ask additional “chain of thought” style questions. Something like: “oh, they are complaining about our horrible support folks? When they complain about those horrible support folks, what do they suggest as the top way THEY would solve that problem?”
  
  Think about all the SQL and code needed to do that. And, perhaps surprisingly, IT WORKS (mostly). This is what’s coming. I say this not to scare SQL Developers into thinking “SQL code generation will put me out of a job”. It won’t. It will actually mean we need MORE SQL DEVELOPERS. That’s Jevon’s Paradox. The more a task becomes easy, the more folks realize the unmet demand for that task. But you’re gonna need to be a good SQL developer who is interested in solving business problems. The invention of the backhoe didn’t mean ditch diggers starved, it meant the world realized it needs EVEN MORE HOLES. Point is: don’t be scared, don’t fight it, but figure out how to leverage it to relieve some of your cognitive burden. Your POWER USER is crashing your database because of unmet demand.
  
  Reply
Dew Drop – May 15, 2024 (#4189) – Morning Dew by Alvin Ashcraft
May 15, 2024 11:25 am

[…] Why Generative AI Scares the Hell Out of DBAs (Brent Ozar) […]

Reply
John B.
May 15, 2024 1:49 pm

You might want to remove the spam from Dew Drop just above.

Reply
Francesco Mantovani
May 16, 2024 11:11 am

I think it already happened: https://securityboulevard.com/2023/10/yes-githubs-copilot-can-leak-real-secrets/

Reply