IQ Tests: What Do They Really Measure?

🧠 Did you know there are tests that try to see how your brain works? They are called IQ tests!

A long time ago, a man named Alfred wanted to help kids who needed extra help in school. So he made a test with puzzles and questions! 🧩

But here is the thing: being smart is NOT just one thing. Some people are amazing at drawing. Some people are amazing at singing. Some people are amazing at making friends feel happy! 🎨🎵💖

No test in the whole wide world can measure ALL the cool things your brain can do. Your brain is special in its own way! 🌟

So if someone ever talks about IQ tests, just remember: they only measure a tiny little piece of how awesome you are! 🧠✨

What Is an IQ Test?

IQ stands for "Intelligence Quotient," which is a big fancy way of saying "a number that tries to measure how you think." A long time ago, a French scientist named Alfred Binet made the very first one in 1905 because the government asked him to find out which students needed extra help in school.

How Does It Work?

An IQ test asks you to do things like solve puzzles, remember lists of numbers, figure out patterns, and explain what words mean. It is like a brain workout! The average score is 100, which means most people score right around that number.

Can a Test Really Measure How Smart You Are?

Here is the honest answer: not completely! IQ tests are good at measuring some things, like how fast you can spot patterns or how well you remember information. But they cannot measure how creative you are, how kind you are, how well you work with a team, or how brave you are when things get hard. Being smart comes in many, many different flavors, and IQ tests only taste a few of them.

Why Does This Matter?

Alfred Binet himself said his test should NEVER be used to label a child as "smart" or "not smart." He wanted it to be a tool to help, not a way to sort people into boxes. That is still the most important thing to remember about IQ tests today.

Where IQ Tests Came From

In 1904, the French government had a problem. They had just passed a law saying every child must go to school, but some children struggled much more than others, and teachers needed a way to figure out which students needed extra help. So the government asked a psychologist named Alfred Binet and his colleague Theodore Simon to build a test.

In 1905, they created the Binet-Simon Scale. It had 30 tasks arranged from easy to hard: following simple commands, naming objects in pictures, repeating sentences, explaining the difference between words like "bored" and "tired," and solving logic puzzles. The idea was simple: if a 7-year-old could only complete tasks that most 5-year-olds could do, that child probably needed extra support. Binet called this concept "mental age."

How the Score Works

In 1912, a German psychologist named William Stern came up with the idea of dividing a child's mental age by their actual age and multiplying by 100 to get a single number. That number became the "Intelligence Quotient," or IQ. So if a 10-year-old performed at the level of a 12-year-old, their IQ would be 12 divided by 10, times 100, which equals 120.

Today, IQ tests do not actually use that formula anymore. Instead, they compare your score to thousands of other people your age. The average is set at 100, and about two out of every three people score between 85 and 115. Scores above 130 are considered "gifted," and scores below 70 may indicate a learning disability that deserves support.

Here is something wild: IQ scores have been going UP over time! Every generation scores about 3 points higher than the one before. This is called the "Flynn Effect," named after researcher James Flynn, who discovered it in 1984. Nobody is completely sure why it happens, but better nutrition, more schooling, and growing up surrounded by complex technology probably all play a part.

What IQ Tests Are Good At

IQ tests are pretty good at predicting how well someone will do in school, which makes sense because that is exactly what Alfred Binet designed them for. They test skills like pattern recognition, vocabulary, working memory (holding information in your head while you use it), and processing speed (how fast your brain handles information). These are all things that help in a classroom.

What IQ Tests Miss

IQ tests cannot measure creativity, emotional intelligence (understanding other people's feelings), musical talent, athletic ability, leadership skills, or practical problem-solving like figuring out how to fix a broken bike or cook dinner for your family. A psychologist named Howard Gardner argued in 1983 that there are at least eight different kinds of intelligence, and IQ tests only measure two or three of them.

Alfred Binet himself warned that his test was limited. He wrote: "The scale does not permit the measure of intelligence, because intellectual qualities cannot be measured as linear surfaces are measured." He never wanted anyone to use a single number to decide what a person was capable of, and he would probably be upset to learn how often that has happened in the century since he died.

The Origin Story: France, 1905

Alfred Binet was a French psychologist who believed intelligence was complex, malleable, and impossible to capture in a single number. The irony of his career is that he invented the tool most commonly used to do exactly that. In 1904, France's Minister of Public Instruction commissioned Binet and his colleague Theodore Simon to develop a method for identifying children who needed special educational support. The resulting Binet-Simon Scale (1905) contained 30 tasks of increasing difficulty, and Binet introduced the concept of "mental age" to describe where a child's cognitive development fell relative to their peers.

Binet was explicit about the test's limitations. He insisted it should not be used to rank children, should not be treated as a fixed measure of innate ability, and should never be used as a label. "Some recent philosophers," he wrote in 1905, "seem to have given their moral support to these deplorable verdicts by asserting that an individual's intelligence is a fixed quantity, a quantity that cannot be increased. We must protest and react against this brutal pessimism." Binet died in 1911 at age 54, before he could see how thoroughly his warnings would be ignored.

How IQ Tests Crossed the Atlantic

In 1916, Lewis Terman, a psychologist at Stanford University, adapted the Binet-Simon Scale for American use, creating the Stanford-Binet Intelligence Scale. Terman made a critical change: while Binet had designed the test to identify children who needed help, Terman wanted to rank all children from least to most intelligent. He adopted William Stern's "Intelligence Quotient" formula (mental age divided by chronological age, multiplied by 100) and applied it broadly.

Then came World War I. The U.S. Army needed to sort 1.7 million recruits quickly, so psychologist Robert Yerkes developed the Army Alpha and Army Beta tests (Alpha for English speakers, Beta for illiterate recruits and non-English speakers). These were the first group-administered intelligence tests, and they introduced IQ testing to the mass scale. The results were misused almost immediately: they were cited to argue that certain ethnic groups and immigrants were intellectually inferior, which contributed directly to the Immigration Act of 1924 that restricted entry from Southern and Eastern Europe.

The IQ scoring system today. Modern IQ tests no longer use the mental-age-divided-by-actual-age formula. Instead, they use a "deviation IQ" system: your raw score is compared to a large sample of people your age, and the result is placed on a bell curve with a mean of 100 and a standard deviation of 15. This means about 68% of people score between 85 and 115, about 95% score between 70 and 130, and scores above 130 or below 70 are statistically unusual.

Modern IQ Tests: What They Actually Test

The most widely used IQ tests today are the Wechsler Adult Intelligence Scale (WAIS, now in its 4th edition) for adults and the Wechsler Intelligence Scale for Children (WISC) for ages 6 to 16. David Wechsler, a Romanian-American psychologist, created the first version in 1939 because he thought the Stanford-Binet relied too heavily on verbal skills. His test measures four main areas:

Verbal Comprehension: vocabulary, general knowledge, reasoning with words
Perceptual Reasoning: pattern recognition, spatial thinking, visual puzzles
Working Memory: holding and manipulating information in your head (like repeating a string of numbers backward)
Processing Speed: how quickly you can scan information and make simple decisions

Another important test is Raven's Progressive Matrices, created by John C. Raven in 1936. It uses only abstract visual patterns with no words at all, which was designed to reduce the influence of language and culture on scores. You see a grid of shapes with one piece missing and choose which option completes the pattern. It is often considered one of the "purest" measures of fluid intelligence, which is the ability to solve new problems without relying on previously learned information.

The Accuracy Debate

IQ tests have a test-retest reliability of about 0.90 to 0.95, which means if you take the same test twice, your score will usually be very close both times. That is statistically strong. They are also decent predictors of academic performance, with correlation coefficients around 0.50 to 0.60 between IQ scores and school grades.

But prediction is not the same as measurement. IQ tests predict school success partly because schools test the same cognitive skills that IQ tests test, which creates a circular argument. They are much weaker predictors of job performance (correlations around 0.20 to 0.30 for most occupations), life satisfaction, or creative achievement. And they carry well-documented cultural biases: questions that assume specific cultural knowledge, vocabulary from certain dialects, and testing environments that may be unfamiliar or stressful for some groups can all skew results in ways that reflect opportunity rather than ability.

The Flynn Effect in numbers:
Average IQ gain: approximately 3 points per decade
Over 100 years: roughly 30 points of increase
This means the average person today would score about 130 on a 1920s IQ test, which is "gifted."
Explanation: better nutrition, more years of education, more exposure to abstract thinking through technology and media. Not an actual increase in raw brainpower.

What IQ Tests Cannot See

Howard Gardner's theory of multiple intelligences (1983) proposed at least eight distinct types: linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal (understanding others), intrapersonal (understanding yourself), and naturalistic (recognizing patterns in nature). Standard IQ tests primarily measure linguistic and logical-mathematical intelligence. The other six go unmeasured.

Psychologist Robert Sternberg offered a different critique with his Triarchic Theory of Intelligence (1985), arguing that intelligence has three components: analytical (what IQ tests measure), creative (generating novel ideas), and practical (solving real-world problems). A person can score average on an IQ test while being exceptional at navigating social situations, running a business, or composing music, and none of those abilities would show up in their score.

Binet's Invention and Binet's Warning

Alfred Binet (1857 to 1911) was a largely self-taught psychologist who spent years studying cognitive development in children, including his own two daughters, Madeleine and Alice, whose different thinking styles convinced him that intelligence was not a single trait but a collection of interacting abilities. When France's compulsory education law of 1882 created classrooms full of students with vastly different preparation levels, the Ministry of Public Instruction turned to Binet in 1904 to develop a diagnostic tool.

The Binet-Simon Scale of 1905, revised in 1908 and 1911, introduced "mental age" as a diagnostic concept. A 10-year-old who could complete tasks typical of an average 13-year-old had a mental age of 13. Binet was careful to specify that mental age was descriptive, not explanatory: it told you where a child stood relative to peers, not why they stood there, and it said nothing about whether the gap could be closed with proper instruction. He explicitly rejected the idea that intelligence was fixed, writing that "with practice, training, and above all method, we manage to increase our attention, our memory, our judgment, and literally to become more intelligent than we were before."

From Diagnostic Tool to Sorting Mechanism

Lewis Terman's 1916 Stanford-Binet test transformed Binet's diagnostic instrument into a ranking system. Terman was a committed eugenicist who believed intelligence was primarily hereditary and that IQ testing could identify genetically superior individuals. His "Genetic Studies of Genius" project (begun in 1921) tracked 1,528 children with IQs above 135 throughout their lives. The study revealed that high-IQ children tended to be successful by conventional metrics (education, income, professional status) but also that IQ was a far weaker predictor of life satisfaction, creative achievement, or contribution to society than Terman had expected.

The militarization of IQ testing during World War I amplified both the scale and the political consequences. Robert Yerkes's Army Alpha and Army Beta tests were administered to approximately 1.7 million recruits between 1917 and 1919. The data was used to make sweeping claims about racial and ethnic hierarchies: Carl Brigham, a Princeton psychologist who analyzed the Army test data, published A Study of American Intelligence (1923) arguing that immigration was degrading American intellectual stock. His work directly influenced the restrictive Immigration Act of 1924. Brigham later retracted his conclusions, admitting in 1930 that the tests measured familiarity with American culture, not innate intelligence. By then, the political damage was done.

The Army Beta test, designed for illiterate and non-English-speaking recruits, asked test-takers to complete drawings of familiar objects. One item showed a bowling alley missing a ball. This is a test of cultural familiarity, not intelligence: a recent immigrant from rural Sicily had never seen a bowling alley. The distinction between testing knowledge and testing cognitive ability has haunted IQ testing from the beginning.

The Architecture of Modern IQ Tests

David Wechsler's intelligence scales (WAIS for adults, first published 1955; WISC for children, first published 1949) dominate clinical IQ assessment worldwide. The current WAIS-IV (2008) comprises 15 subtests grouped into four index scores:

Verbal Comprehension Index (VCI): Vocabulary (define words of increasing difficulty), Similarities (how are "dog" and "lion" alike?), Information (general knowledge), Comprehension (explain social conventions)
Perceptual Reasoning Index (PRI): Block Design (replicate patterns using colored blocks), Matrix Reasoning (complete visual pattern sequences), Visual Puzzles (reconstruct shapes from fragments)
Working Memory Index (WMI): Digit Span (repeat number sequences forward, backward, and in order), Arithmetic (mental math under time pressure)
Processing Speed Index (PSI): Symbol Search (scan rows for matching symbols), Coding (pair symbols with numbers under time pressure)

The full-scale IQ (FSIQ) is a composite of all four indices, normed to a mean of 100 with a standard deviation of 15. Each index also receives its own score. A person can score 130 on Verbal Comprehension and 95 on Processing Speed, and the composite FSIQ of roughly 112 would obscure that split entirely, which is why many clinicians argue that the full-scale number is less informative than the profile of individual index scores.

Raven's Progressive Matrices (1936, revised 1998) takes the opposite approach: 60 abstract visual puzzles with no verbal content, no cultural knowledge requirements, and no time pressure (in the untimed "Standard" version). It measures fluid intelligence (Gf), the ability to reason about novel problems, as opposed to crystallized intelligence (Gc), which reflects accumulated knowledge and learned skills. The distinction between Gf and Gc, proposed by Raymond Cattell in 1963, is now central to the Cattell-Horn-Carroll (CHC) theory that underpins most modern intelligence test design.

What the Numbers Actually Predict

The empirical evidence on IQ's predictive power is genuinely mixed. Meta-analyses show moderate correlations with academic achievement (r = 0.50 to 0.60), which is expected since both IQ tests and academic assessments reward similar cognitive skills. Correlations with job performance are weaker and vary by occupation: about 0.20 to 0.30 for most jobs, rising to 0.40 to 0.50 for highly complex roles like engineering or scientific research. IQ has near-zero correlation with measures of creativity as assessed by divergent thinking tasks, artistic output, or scientific innovation beyond a threshold of about 120.

The "threshold hypothesis" suggests that above an IQ of approximately 120, additional IQ points add diminishing returns, and other factors (motivation, opportunity, personality traits like openness and conscientiousness, domain-specific training) become far more important. Lewis Terman's own longitudinal study inadvertently supported this: two children who were screened out of his study for scoring below the 135 cutoff, William Shockley and Luis Alvarez, went on to win Nobel Prizes in Physics, while none of Terman's "geniuses" did.

The Cultural Bias Problem

IQ tests carry several well-documented forms of bias. Content bias occurs when questions assume cultural knowledge: "Who wrote Hamlet?" tests literary education, not reasoning ability. Stereotype threat, demonstrated by Claude Steele and Joshua Aronson in 1995, shows that reminding test-takers of negative stereotypes about their group's intelligence reduces their scores by 10 to 15 points, even when the test content itself is neutral. Testing environment effects (unfamiliar settings, test administrator demographics, time pressure) disproportionately affect groups who have less experience with formal assessment contexts.

The Flynn Effect further undermines the idea that IQ scores reflect a fixed biological reality. James Flynn documented that average IQ scores in industrialized nations rose approximately 3 points per decade throughout the 20th century. By this measure, the average American of 1920 would score about 73 on a modern IQ test, which would place them in the range associated with intellectual disability. Nobody believes Americans in 1920 were cognitively disabled, which means IQ scores are substantially influenced by environmental factors including nutrition, education, and exposure to abstract thinking that changes across generations.

The heritability of IQ is estimated at 0.50 to 0.80 in twin studies, but heritability is frequently misunderstood. A heritability of 0.80 does NOT mean 80% of your intelligence is genetic. It means that in the population studied, 80% of the variation in scores can be attributed to genetic differences. Crucially, heritability estimates are population-specific and environment-dependent: in populations where everyone has equal access to education and nutrition, heritability appears higher (because environmental variance is reduced), while in populations with large disparities in opportunity, environmental factors dominate. A highly heritable trait can still be dramatically influenced by environment. Height is roughly 80% heritable, yet average height has increased by 4 inches in developed countries over the past century due to better nutrition.

Alternative Frameworks

Howard Gardner's theory of multiple intelligences (1983) identified eight distinct intelligences: linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, intrapersonal, and naturalistic. Standard IQ tests cover at most three. Gardner's theory has been influential in education but criticized by psychometricians for lacking empirical validation through factor analysis, the statistical technique used to verify that a proposed cognitive ability actually represents a distinct, measurable construct.

Robert Sternberg's Triarchic Theory (1985) proposed three forms of intelligence: analytical (academic problem-solving), creative (generating novel solutions), and practical (adapting to real-world environments). His research showed that students taught in ways that addressed all three forms outperformed those taught in traditional analytical-only formats, even on conventional tests.

The most recent framework dominating the field is the Cattell-Horn-Carroll (CHC) theory, which identifies roughly 70 narrow abilities organized under 10 broad abilities, of which fluid reasoning (Gf) and crystallized knowledge (Gc) are the most prominent. Modern IQ tests like the WAIS-IV and the Woodcock-Johnson IV are explicitly designed around CHC architecture.

The Invention That Outlived Its Inventor's Intentions

Alfred Binet (1857 to 1911) created the first standardized intelligence test with a remarkably specific and modest purpose: identifying Parisian schoolchildren who needed supplementary instruction after France's 1882 compulsory education law filled classrooms with children of vastly different backgrounds. The Binet-Simon Scale of 1905 was a diagnostic tool, not a measurement of innate ability, and Binet was explicit about this distinction in ways that subsequent adopters of his work would systematically ignore.

Binet rejected three claims that would later become central to the IQ testing movement: that intelligence is a single, unitary trait; that it is fixed and immutable; and that it can be meaningfully captured by a single number. He wrote in 1905 that "the scale, properly speaking, does not permit the measure of intelligence, because intellectual qualities are not superposable, and therefore cannot be measured as linear surfaces are measured." He viewed his test as a rough practical tool, analogous to a thermometer that indicates fever without explaining the disease. The transformation of that thermometer into a permanent label on human potential is one of the more consequential misapplications in the history of psychology.

The Stanford-Binet and the Eugenics Connection

Lewis Terman's 1916 Stanford-Binet adaptation did not merely translate Binet's test into English. It embedded the test within a hereditarian ideology that Binet had explicitly opposed. Terman wrote in The Measurement of Intelligence (1916) that "the children of successful and cultured parents test higher than children from wretched and ignorant homes for the simple reason that their heredity is better." He advocated using IQ tests to identify the "feebleminded" for institutionalization and to restrict reproduction, positions that aligned directly with the eugenics movement then gaining political power in the United States.

Terman was not an outlier. Henry Goddard, who introduced the Binet-Simon Scale to America in 1908, administered IQ tests to immigrants arriving at Ellis Island in 1913 and concluded that 83% of Jewish immigrants, 80% of Hungarian immigrants, 79% of Italian immigrants, and 87% of Russian immigrants were "feebleminded." These numbers are absurd on their face: he was testing frightened, exhausted people in an unfamiliar language in a disorienting institutional setting. But the data was presented as science and used to justify restrictive immigration policy.

Robert Yerkes's Army Alpha and Beta tests (1917 to 1919) scaled IQ testing to 1.7 million military recruits and generated data that Carl Brigham used to argue in A Study of American Intelligence (1923) for the intellectual inferiority of immigrants and Black Americans. Brigham later publicly retracted these conclusions (1930), acknowledging that the tests measured cultural assimilation, not cognitive ability. He went on to create the SAT (Scholastic Aptitude Test) in 1926, which inherited the same conceptual tension between "aptitude" and "achievement" that plagues IQ testing to this day.

Psychometric Architecture: What Modern Tests Measure

The Wechsler Adult Intelligence Scale (WAIS-IV, 2008) represents the current clinical standard. It comprises 15 subtests organized into four index scores (Verbal Comprehension, Perceptual Reasoning, Working Memory, Processing Speed) that combine into a Full-Scale IQ (FSIQ). The test is normed against a stratified random sample of the national population, producing a Gaussian distribution with a mean of 100 and a standard deviation of 15.

The psychometric properties are strong within their domain. Internal consistency coefficients (Cronbach's alpha) for the FSIQ typically exceed 0.97. Test-retest reliability ranges from 0.87 to 0.96 across indices. These are impressive numbers. The problem is not reliability (does the test measure the same thing consistently?) but validity (does it measure what it claims to measure?).

The construct validity question turns on what "intelligence" means. The dominant theoretical framework in contemporary psychometrics is the Cattell-Horn-Carroll (CHC) model, which decomposes cognitive ability into roughly 70 narrow abilities organized under 10 broad stratum abilities, most prominently fluid reasoning (Gf, novel problem-solving) and crystallized intelligence (Gc, accumulated knowledge). The WAIS-IV and Stanford-Binet 5 are both designed to map onto CHC constructs, and factor-analytic studies confirm that they load onto the expected latent variables with reasonable fidelity.

But the CHC model itself is a statistical description, not a causal explanation. The "g factor" (general intelligence), first proposed by Charles Spearman in 1904, emerges consistently from factor analysis of cognitive test batteries: people who score high on one type of cognitive test tend to score high on others. Whether g represents a single underlying biological mechanism (neural efficiency? white matter connectivity? working memory capacity?) or merely reflects correlated environmental advantages (nutrition, education, test familiarity) remains one of the most contested questions in behavioral science.

The g factor paradox. Spearman's g is the most replicated finding in psychometrics: it emerges from essentially every large-scale cognitive test battery ever administered. It also explains roughly 40 to 50% of the variance in cognitive test performance, which means it is simultaneously the single most powerful predictor and an explanation that leaves more than half the variance unexplained. The appropriate response to g is not to deny its existence (the statistical evidence is overwhelming) or to reify it as "true intelligence" (it is a mathematical construct, not a thing in the brain), but to treat it as a useful but incomplete summary of cognitive performance patterns.

Predictive Validity: What the Correlations Actually Show

IQ's predictive validity varies dramatically by domain. Meta-analytic estimates compiled by Frank Schmidt and John Hunter (1998, updated 2004) show correlations of 0.51 between IQ and job performance for medium-complexity jobs and 0.56 for high-complexity jobs. These are moderate effect sizes, explaining roughly 25 to 30% of the variance. But these estimates have been challenged: subsequent analyses by Hough et al. (2001) and Richardson and Norgate (2015) suggest that range restriction corrections and criterion reliability adjustments in the Schmidt-Hunter work may have inflated the true correlations by 30 to 50%.

IQ predicts educational attainment (r = 0.55 to 0.60), income (r = 0.30 to 0.40), and occupational prestige (r = 0.50 to 0.55). It is a near-zero predictor of creative achievement beyond a threshold of approximately IQ 120, as demonstrated by both Terman's longitudinal study (in which no participants won Nobel Prizes, while two children excluded from the study for insufficient IQ did) and subsequent research by Simonton (2003) on creative eminence.

The direction of causality is also unclear. IQ predicts income, but income predicts IQ test scores too: children raised in higher-income households score higher on IQ tests than genetically similar children raised in lower-income households, as adoption studies (Duyme et al., 1999) have demonstrated. The French adoption study showed that children born to low-SES biological parents who were adopted by high-SES families gained 12 to 16 IQ points relative to siblings who remained with the biological parents. These are massive environmental effects on a supposedly "innate" trait.

Cultural Bias and Stereotype Threat

Claude Steele and Joshua Aronson's 1995 experiments on stereotype threat demonstrated that Black college students performed significantly worse on a verbal reasoning test when told the test measured "intellectual ability" than when told it was a "laboratory problem-solving task." The test questions were identical; only the framing changed. Subsequent meta-analyses (Nguyen and Ryan, 2008) confirmed the effect across racial, gender, and socioeconomic groups, with threat-induced performance decrements of approximately 0.20 to 0.40 standard deviations, equivalent to 3 to 6 IQ points.

Content bias remains despite decades of effort to eliminate it. "Culture-fair" tests like Raven's Progressive Matrices reduce linguistic and knowledge-based bias but do not eliminate the advantage conferred by familiarity with abstract visual patterns, multiple-choice testing formats, and timed testing conditions. Test familiarity effects are well-documented: simply taking a practice IQ test before the scored test raises scores by 5 to 8 points on average (Hausknecht et al., 2007), which means the gap between "tested" and "untested" populations is partly an artifact of testing experience rather than cognitive ability.

The Flynn Effect and What It Breaks

James Flynn's observation (1984, expanded 1987 and 2007) that IQ scores in industrialized nations rose approximately 3 points per decade throughout the 20th century is one of the most disruptive findings in the history of psychometrics. The gains were largest on fluid intelligence measures (Raven's Progressive Matrices showed gains of 5 to 6 points per decade) and smallest on crystallized intelligence tests like Vocabulary. By extrapolation, the average American of 1900 would score approximately 67 on a modern IQ test, deep into the range classified as intellectual disability.

This is obviously not plausible. The Flynn Effect demonstrates that IQ scores are substantially shaped by environmental factors that change across generations: nutrition (iodine deficiency alone reduces IQ by approximately 12 points and was common before salt iodization in the 1920s), years of formal education, family size (children in smaller families score higher, and family sizes have shrunk dramatically), and cognitive complexity of the environment (video games, technology interfaces, and analytical work all exercise the abstract reasoning skills that fluid intelligence tests measure).

The Flynn Effect is important because it breaks the hereditarian interpretation of IQ gaps between groups. If environmental factors can produce a 30-point shift across three generations within a genetically stable population, then any between-group gap of comparable magnitude can also be environmental in origin. This does not prove that group gaps are entirely environmental, but it eliminates the claim that large gaps require genetic explanations.

Where the Field Stands Now

The current consensus among research psychologists, insofar as one exists, holds several positions simultaneously. IQ tests reliably measure a real set of cognitive abilities. Those abilities matter for academic and occupational performance. But IQ tests do not measure "intelligence" in any comprehensive sense: they measure a specific subset of cognitive skills, under specific testing conditions, at a specific point in time. The score is meaningfully influenced by nutrition, education, test familiarity, stereotype threat, and testing conditions. It says nothing about creativity, emotional intelligence, wisdom, moral reasoning, or practical problem-solving ability.

The most useful applications of IQ testing today are clinical rather than comparative: identifying specific cognitive strengths and weaknesses in individuals who may benefit from tailored educational support, diagnosing intellectual disabilities, and tracking cognitive changes associated with aging or neurological conditions. These applications align with Binet's original purpose. The ranking, sorting, and labeling applications align with Terman's, and they remain as scientifically problematic and ethically fraught as they were a century ago.

Sources

1. Alfred Binet and Theodore Simon, The Development of Intelligence in Children (1916 translation by Elizabeth Kite).
2. Stephen Jay Gould, The Mismeasure of Man (revised edition, W.W. Norton, 1996).
3. James R. Flynn, What Is Intelligence? Beyond the Flynn Effect (Cambridge University Press, 2007).
4. Claude M. Steele and Joshua Aronson, "Stereotype Threat and the Intellectual Test Performance of African Americans," Journal of Personality and Social Psychology 69(5), 1995.
5. Howard Gardner, Frames of Mind: The Theory of Multiple Intelligences (Basic Books, 1983).
6. Robert J. Sternberg, Beyond IQ: A Triarchic Theory of Human Intelligence (Cambridge University Press, 1985).
7. Frank L. Schmidt and John E. Hunter, "The Validity and Utility of Selection Methods in Personnel Psychology," Psychological Bulletin 124(2), 1998.
8. Michel Duyme et al., "How Can We Boost IQs of 'Dull Children'? A Late Adoption Study," Proceedings of the National Academy of Sciences 96(15), 1999.
9. Lewis M. Terman, The Measurement of Intelligence (Houghton Mifflin, 1916).
10. John C. Raven, "Mental Tests Used in Genetic Studies: The Performance of Related Individuals on Tests Mainly Educative and Mainly Reproductive," MSc thesis, University of London, 1936.