This article first appeared on ul.ie.
Whether you’re an AI advocate or AI adversary, there’s no doubt that, as a technology, it’s here to stay.
ChatGPT and other generative AI tools have quickly become a mainstay of the modern workplace, and education is no exception. It is important therefore that we understand the basics of how this technology works so that we can at the very least make informed decisions about its use.
How it works
Believe it or not, the algorithms that power predictive text are a precursor to ChatGPT.
Yes, the annoying autocorrect on your old Nokia contains the seeds of this technology. ChatGPT is a species of Large Language Model (LLM) that, at its core, works on a word-by-word (or more accurately, token-by-token) basis, and its main concern is always to produce a ‘reasonable continuation’ of the text generated so far.
Crucially, this includes the prompt supplied by the user, so it isn’t quite accurate to think of ChatGPT as ‘responding’ to prompts so much as extending them.
But how does it respond so coherently? Well, it’s really a question of scale. ChatGPT has been pre-trained on terabytes upon terabytes of textual data – the equivalent of billions of web pages.
To put this in perspective, Shakespeare’s complete works clocks in at about 3.4 megabytes, which could fit into ChatGPT’s original dataset more than 13.2 million times over. *
The bank of content it can draw upon is incomprehensibly enormous, and extremely sophisticated probability calculations are used to sift through that data to map out what the next word or token should be.
Without all that (human) data, it would be mute as a field mouse.
If it produces well-written content, it is because of the well-written content it was trained on, and this was written by (you guessed it) humans.
This is important to remember, as ChatGPT won’t tell you (nor is it able to tell you) how it has sourced its content.
Is it ‘cheating’?
How did ChatGPT acquire this training data? Was all of it acquired legally? The short answer is no. Its training material contains many copyrighted works and has resulted in legal challenges.
Is using it a kind of ‘cheating’? Relying on it to produce unedited and unattributed finished works would be (so students shouldn’t be using it to write their essays) but using it as a kickstart to idea generation, a sounding board, or a summarising tool is perfectly legitimate.
If in doubt, ask yourself this question: are you using it to improve your craft, or outsource it?
Is it safe?
There have been numerous instances of ChatGPT concocting (or hallucinating) falsehoods it presents as facts, so always check your sources.
There also exist real privacy concerns; it will store any information you feed it, which poses dangers from the point of view of both confidentiality and cybersecurity. Always exercise caution with your prompts and assume they can be made public.
So, in conclusion
- ChatGPT is, in simple terms, a very sophisticated predictive text machine.
- It doesn’t answer (and therefore fully understand) the prompt supplied by the user but continues it.
- It is nothing without the human content it was trained on, and not all this content was acquired legally.
- If you do choose to use it, let it complement your work, not compete with it.
- Always be data aware – don’t supply it with confidential information.
* ChatGPT was trained on 45 terabytes (or 45 million megabytes) worth of data, and Shakespeare’s complete works contain about 3.5 million characters (or bytes).