Episodes

Monday Feb 17, 2025
Monday Feb 17, 2025
Summary of https://www.researchgate.net/publication/388234257_What_large_language_models_know_and_what_people_think_they_know
This study investigates how well large language models (LLMs) communicate their uncertainty to users and how human perception aligns with the LLMs' actual confidence. The research identifies a "calibration gap" where users overestimate LLM accuracy, especially with default explanations.
Longer explanations increase user confidence without improving accuracy, indicating shallow processing. By tailoring explanations to reflect the LLM's internal confidence, the study demonstrates a reduction in both the calibration and discrimination gaps, leading to improved user perception of LLM reliability.
The study underscores the importance of transparent uncertainty communication for trustworthy AI-assisted decision-making, advocating for explanations aligned with model confidence.
The study examines how well large language models (LLMs) communicate uncertainty and how humans perceive the accuracy of LLM responses. It identifies gaps between LLM confidence and human confidence, and explores methods to improve user perception of LLM accuracy.
Here are 5 key takeaways:
Calibration and Discrimination Gaps: There's a notable difference between an LLM's internal confidence in its answers and how confident humans are in those same answers. Humans often overestimate the accuracy of LLM responses, and are not good at distinguishing between correct and incorrect answers based on default explanations.
Explanation Length Matters: Longer explanations from LLMs tend to increase user confidence, even if the added length doesn't actually improve the accuracy or informativeness of the answer.
Uncertainty Language Influences Perception: Human confidence is strongly influenced by the type of uncertainty language used in LLM explanations. Low-confidence statements lead to lower human confidence, while high-confidence statements lead to higher human confidence.
Tailoring Explanations Reduces Gaps: By adjusting LLM explanations to better reflect the model's internal confidence, the calibration and discrimination gaps can be narrowed. This improves user perception of LLM accuracy.
Limited User Expertise: Participants in the study generally lacked the expertise to accurately assess LLM responses independently. Even when users altered the LLM's answer, their accuracy was lower than the LLM's.

Monday Feb 17, 2025
Monday Feb 17, 2025
Summary of https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5136877
This research paper explores the impact of Generative AI on the labor market. A new survey analyzes the use of these tools, finding that they are most commonly used by younger, more educated, and higher-income individuals in specific industries.
The study finds that approximately 30% of respondents have used Generative AI at work. It investigates the efficiency gains from using Generative AI and its role in job searches. The paper aims to measure the large-scale labor market effects of Generative AI and the wage structure impacts of such tools. Finally, the researchers intend to continue tracking Generative AI and its effect on the labor market in real-time.
Here are the key takeaways regarding the labor market effects of Generative AI, according to the source:
As of December 2024, 30.1% of survey respondents over 18 have used Generative AI at work since these tools became available to the public.
Generative AI tools are most commonly used by younger, more educated, and higher-income individuals, as well as those in customer service, marketing, and IT.
A survey found that workers use generative AI for about one-third of their work week, which is equivalent to an average of 7 tasks per week. Generative AI has been used to assist workers in doing tasks more quickly.
Workers using Generative AI spend approximately 30 minutes interacting with the tool to complete a task, which they estimate would take 90 minutes without it, suggesting that Generative AI can potentially triple worker productivity.
The impact of LLMs can be a substitute for some forms of labor while also acting as a productivity-enhancing complement for other forms of labor.

Monday Feb 17, 2025
Monday Feb 17, 2025
Summary of https://arxiv.org/pdf/2502.02649
The paper argues against developing fully autonomous AI agents due to the increasing risks they pose to human safety, security, and privacy.
It analyzes different levels of AI agent autonomy, highlighting how risks escalate as human control diminishes. The authors contend that while semi-autonomous systems offer a more balanced risk-benefit profile, fully autonomous agents have the potential to override human control.
They emphasize the need for clear distinctions between agent autonomy levels and the development of robust human control mechanisms. The research also identifies potential benefits related to assistance, efficiency, and relevance, but concludes that the inherent risks, especially concerning accuracy and truthfulness, outweigh these advantages in fully autonomous systems.
The paper advocates for caution and control in AI agent development, suggesting that human oversight should always be maintained, and proposes solutions to better understand the risks associated with autonomous systems.
Here are five key takeaways regarding the development and ethical implications of AI agents, according to the source:
The development of fully autonomous AI agents—systems that can write and execute code beyond predefined constraints—should be avoided due to potential risks.
Risks to individuals increase with the autonomy of AI systems because the more control ceded to an AI agent, the more risks arise. Safety risks are particularly concerning, as they can affect human life and impact other values.
AI agent levels can be categorized on a scale that corresponds to decreasing user input and decreasing code written by developers, which means the more autonomous the system, the more human control is ceded.
Increased autonomy in AI agents can amplify existing vulnerabilities related to safety, security, privacy, accuracy, consistency, equity, flexibility, and truthfulness.
There are potential benefits to AI agent development, particularly with semi-autonomous systems that retain some level of human control, which may offer a more favorable risk-benefit profile depending on the degree of autonomy and complexity of assigned tasks. These benefits include assistance, efficiency, equity, relevance, and sustainability.

Monday Feb 17, 2025
Monday Feb 17, 2025
Summary of https://arxiv.org/pdf/2409.09047
This paper explores the effects of large language models (LLMs) on student learning in coding classes. Three studies were conducted to analyze how LLMs impact learning outcomes, revealing both positive and negative effects.
Using LLMs as personal tutors by asking for explanations was found to improve learning, while relying on them to solve exercises hindered it.
Copy-and-paste functionality was identified as a key factor influencing LLM usage and its subsequent impact. The research also demonstrates that students may overestimate their learning progress when using LLMs, highlighting potential pitfalls.
Finally, results indicated that less skilled students may benefit more from LLMs when learning to code.
Here are five key takeaways regarding the use of Large Language Models (LLMs) in learning to code, according to the source:
LLMs can have both positive and negative effects on learning outcomes. Using LLMs as personal tutors by asking for explanations can improve learning, but relying on them excessively to solve practice exercises can impair learning.
Copy-and-paste functionality plays a significant role in how LLMs are used. It enables solution-seeking behavior, which can decrease learning.
Students with less prior domain knowledge may benefit more from LLM access. However, those new to LLMs may be more prone to over-reliance.
LLMs can increase students’ perceived learning progress, even when controlling for actual progress. This suggests that LLMs may lead to an overestimation of one’s own abilities.
The effect of LLM usage on learning depends on balancing reliance on LLM-generated solutions and using LLMs as personal tutors, and can vary depending on the specific case.

Monday Feb 17, 2025
Monday Feb 17, 2025
Summary of https://mitsloanedtech.mit.edu/ai/teach/ai-detectors-dont-work
AI detection software is unreliable and should not be used to police academic integrity. Instead, instructors should establish clear AI use policies, promote transparent discussions about appropriate AI usage, and design engaging assignments that motivate genuine student learning.
Thoughtful assignment design can foster intrinsic motivation and reduce the temptation to misuse AI. It is also important to employ inclusive teaching methods and fair assessments so all students have the opportunity to succeed. Ultimately, the source promotes the idea that human-centered learning experiences will always be more impactful for students.
Here are the key takeaways regarding AI use in education, according to the source:
AI detection software is unreliable and can lead to false accusations of misconduct.
It is important to establish clear policies and expectations regarding if, when, and how AI should be used in coursework, and communicate these to students in writing and in person.
Instructors should promote transparency and open dialogue with students about AI tools to build trust and facilitate meaningful learning.
Thoughtfully designed assignments can foster intrinsic motivation and reduce the temptation to misuse AI.
To ensure inclusive teaching, use a mix of assessment approaches to give every student an equitable opportunity to demonstrate their capabilities.

Monday Feb 10, 2025
Monday Feb 10, 2025
Summary of https://assets.anthropic.com/m/2e23255f1e84ca97/original/Economic_Tasks_AI_Paper.pdf
This research paper uses data from four million conversations on the Claude.ai platform to empirically analyze how artificial intelligence (AI) is currently used across various occupational tasks in the US economy.
The study maps these conversations to the US Department of Labor's O*NET database to identify usage patterns, finding that AI is most heavily used in software development and writing tasks. The analysis also examines the depth of AI integration within occupations, the types of skills involved in human-AI interactions, and how AI is used to augment or automate tasks.
The researchers acknowledge limitations in their data and methodology but highlight the importance of their empirical approach for tracking AI's evolving role in the economy. The findings suggest AI's current impact is task-specific rather than resulting in complete job displacement.
Here are some surprising facts revealed by the analysis of AI usage patterns in the sources:
AI is not primarily used for automating entire job roles, but rather for specific tasks within occupations. While there is a lot of discussion about AI replacing jobs, the data suggests that AI is more commonly used to enhance human capabilities in specific tasks. This is reflected in the finding that only about 4% of occupations use AI for at least 75% of their tasks.
The peak AI usage is in mid-to-high wage occupations, not in the highest wage brackets. It might be expected that AI would be adopted most in the highest-paying professions, but the analysis shows that occupations requiring considerable preparation, such as those needing a bachelor's degree, and those with mid-to-high salaries are seeing more AI use. This could be because these roles involve tasks that are well-suited to current AI capabilities.
AI is being used for both augmentation and automation almost equally. While there's a lot of focus on AI replacing human work, the study found that 57% of AI interactions showed augmentative patterns (enhancing human capabilities), while 43% demonstrated automation-focused usage (performing tasks directly). This reveals that AI is serving both as an efficiency tool and a collaborative partner.
Cognitive skills are highly represented in AI conversations, but not necessarily at an expert level. Skills like Critical Thinking, Reading Comprehension, and Writing are prevalent, however, the analysis only captures whether a skill was exhibited in the AI's responses, not whether that skill was central to the user's purpose or performed at an expert level. For example, active listening appears as a common skill because the AI rephrases user inputs and asks clarifying questions, rather than because users are seeking listening-focused interactions.
There is a clear specialization in how different AI models are used. For instance, Claude 3.5 Sonnet is more used for coding and software development, while Claude 3 Opus is preferred for creative and educational work. This suggests that different models are not interchangeable, but rather are being adopted to meet specific needs in the economy.
A significant portion of "non-work" interactions still mapped meaningfully to occupational tasks. For example, personal nutrition planning related to dietitian tasks, automated trading strategy development related to financial analyst tasks, and travel itinerary planning related to travel agent tasks. This suggests that AI is influencing a variety of tasks, even in informal contexts.
AI usage is not evenly distributed across all sectors. The study found the highest AI usage in tasks associated with software development, technical writing, and analytical roles. Occupations involving physical labor and those requiring extensive specialized training showed notably lower usage.

Monday Feb 10, 2025
Monday Feb 10, 2025
Summary of https://arxiv.org/pdf/2501.07542
This research paper introduces Multimodal Visualization-of-Thought (MVoT), a novel approach to enhance complex reasoning in large language models (LLMs), particularly in spatial reasoning tasks.
Unlike traditional Chain-of-Thought prompting which relies solely on text, MVoT incorporates visual thinking by generating image visualizations of the reasoning process. The researchers implement MVoT using a multimodal LLM and introduce a token discrepancy loss to improve image quality.
Experiments across various spatial reasoning tasks demonstrate MVoT's superior performance and robustness compared to existing methods, showcasing the benefits of integrating visual and verbal reasoning. The findings highlight the potential of multimodal reasoning for improving LLM capabilities.
Multimodal Visualization-of-Thought (MVoT) is a novel reasoning paradigm that enables models to generate visual representations of their reasoning process, using both words and images. This approach is inspired by human cognition, which uses both verbal and non-verbal channels for information processing. MVoT aims to enhance reasoning quality and model interpretability by providing intuitive visual illustrations alongside textual representation.
MVoT outperforms traditional Chain-of-Thought (CoT) prompting in complex spatial reasoning tasks. While CoT relies solely on verbal thought, MVoT incorporates visual thought to visualize reasoning traces, making it more robust to environmental complexity. MVoT demonstrates better stability and robustness, especially in challenging scenarios where CoT tends to fail, such as in the FROZENLAKE task with complex environments.
Token discrepancy loss enhances the quality of generated visualizations. This loss bridges the gap between separately trained tokenizers in autoregressive Multimodal Large Language Models (MLLMs), improving visual coherence and fidelity. By minimizing the discrepancy between predicted and actual visual embeddings, it reduces redundant patterns and inaccuracies in generated images.
MVoT is more robust to environment complexity compared to CoT. CoT's performance deteriorates as environmental complexity increases, especially in tasks like FROZENLAKE, where CoT struggles with inaccurate coordinate descriptions. MVoT maintains stable performance across varying grid sizes and complexities by visualizing the reasoning process, offering a more direct and interpretable way to track the reasoning process.
MVoT can complement CoT and enhance overall performance. Combining predictions from MVoT and CoT results in significantly higher accuracy, indicating that they offer alternative reasoning strategies. MVoT can also be used as a plug-in for proprietary models like GPT-4o, improving its performance by providing visual thoughts during the reasoning process.

Monday Feb 10, 2025
Monday Feb 10, 2025
Summary of https://advait.org/files/lee_2025_ai_critical_thinking_survey.pdf
This research paper examines the effects of generative AI tools on the critical thinking skills of knowledge workers. A survey of 319 knowledge workers, analyzing 936 real-world examples of GenAI use, reveals that while GenAI reduces perceived cognitive effort, it can also decrease critical engagement and potentially lead to over-reliance.
The study identifies factors influencing critical thinking, such as user confidence in both themselves and the AI, and explores how GenAI shifts the nature of critical thinking in knowledge work tasks. The findings highlight design challenges and opportunities for creating GenAI tools that better support critical thinking.
Here are 5 key takeaways from the provided research on the impact of generative AI (GenAI) on critical thinking among knowledge workers:
GenAI can reduce the effort of critical thinking, but also engagement. While GenAI tools can automate tasks and make information more readily available, this may lead to users becoming over-reliant on AI and reducing their own critical thinking and problem-solving skills.
Confidence in AI negatively correlates with critical thinking, while self-confidence has the opposite effect. The study found that when users have higher confidence in AI's ability to perform a task, they tend to engage in less critical thinking. Conversely, those who have more confidence in their own skills are more likely to engage in critical thinking, even if it requires more effort.
Critical thinking with GenAI shifts from task execution to task oversight. Knowledge workers using GenAI shift their focus from directly producing material to overseeing the AI's work. This includes verifying information, integrating AI responses, and ensuring the output meets quality standards.
Motivators for critical thinking include work quality, avoiding negative outcomes, and skill development. Knowledge workers are motivated to think critically when they want to improve the quality of their work, avoid errors or negative consequences, and develop their own skills.
Barriers to critical thinking include lack of awareness, motivation, and ability. Users may not engage in critical thinking due to a lack of awareness of the need for it, limited motivation due to time pressure or job scope, or because they find it difficult to improve AI responses. Also, some users may consider critical thinking unnecessary when using AI for secondary or trivial tasks, or overestimate AI capabilities.

Monday Feb 10, 2025
Monday Feb 10, 2025
Summary of https://oms-www.files.svdcdn.com/production/downloads/reports/Who%20should%20develop%20which%20AI%20evaluations.pdf
This research memo examines the optimal actors for developing AI model evaluations, considering conflicts of interest and expertise requirements. It proposes a taxonomy of four development approaches (government-led, government-contractor collaborations, third-party grants, and direct AI company development) and nine criteria for selecting developers.
The authors suggest a two-step sorting process to identify suitable developers and recommend measures for a market-based ecosystem fostering diverse, high-quality evaluations, emphasizing a balance between public accountability and private-sector efficiency.
The memo also explores challenges like information sensitivity, model access, and the blurred boundaries between evaluation development, execution, and interpretation. Finally, it proposes several strategies for creating a sustainable market for AI model evaluations.
The authors of this document are Lara Thurnherr, Robert Trager, Amin Oueslati, Christoph Winter, Cliodhna Ní Ghuidhir, Joe O'Brien, Jun Shern Chan, Lorenzo Pacchiardi, Anka Reuel, Merlin Stein, Oliver Guest, Oliver Sourbut, Renan Araujo, Seth Donoughe, and Yi Zeng.
Here are five of the most impressive takeaways from the document:
A variety of actors could develop AI evaluations, including government bodies, academics, third-party organizations, and AI companies themselves. Each of these actors have different characteristics, and different strengths and weaknesses. The document outlines a framework for deciding which of these actors is best suited to develop specific AI evaluations, based on risk and method criteria.
There are four main approaches to developing AI evaluations: AI Safety Institutes (AISIs) developing evaluations independently, AISIs collaborating with contracted experts, funding third parties for independent development, and AI companies developing their own evaluations. Each approach has its own advantages and disadvantages. For instance, while AI companies developing their own evaluations might be cost-effective and leverage their expertise, this approach may create a conflict of interest.
Nine criteria can help determine who should develop specific evaluations. These criteria are divided into risk-related and method-related categories. Risk-related criteria include required risk-related skills and expertise, information sensitivity and security clearances, evaluation urgency, and risk prevention incentives. Method-related criteria include the level of model access required, evaluation development costs, required method-related skills and expertise, and verifiability and documentation.
A market-based ecosystem for AI evaluations is crucial for long-term success. This ecosystem could be supported by measures such as developing and publishing tools, establishing standards and best practices, providing legal certainty and accreditation for third-party evaluators, brokering relationships between third parties and AI companies, and mandating information sharing on evaluation development. Public bodies could also offer funding and computational resources to academic researchers interested in developing evaluations.
The decision of who develops AI evaluations is complex and depends on the specific context. The document emphasizes the importance of considering multiple factors, including the risk being assessed, the methods used, the capabilities of the potential developers, and the potential for conflicts of interest. It suggests that a systematic approach to decision-making can improve the overall quality and effectiveness of AI evaluations.

Friday Feb 07, 2025
Friday Feb 07, 2025
Summary of https://arxiv.org/pdf/2412.14232v1
Contrasts Human-in-the-Loop (HIL) and AI-in-the-Loop (AI2L) systems in artificial intelligence. HIL systems are AI-driven, with humans providing feedback, while AI2L systems place humans in control, using AI as a support tool.
The authors argue that current evaluation methods often favor HIL systems, neglecting the human's crucial role in AI2L systems. They propose a shift towards more human-centric evaluations for AI2L systems, emphasizing factors like interpretability and impact on human decision-making.
The paper uses various examples across diverse domains to illustrate these distinctions, advocating for a more nuanced understanding of human-AI collaboration beyond simple automation. Ultimately, the authors suggest AI2L may be more suitable for complex or ill-defined tasks, where human expertise and judgment remain essential.
Here are the five most relevant takeaways from the sources and our conversation history, emphasizing the shift from a traditional HIL perspective to an AI2L approach:
Control is the Key Differentiator: The crucial difference between Human-in-the-Loop (HIL) and AI-in-the-Loop (AI2L) systems lies in who controls the decision-making process. In HIL systems, AI is in charge, using human input to guide the model, while in AI2L systems, the human is in control, with AI acting as an assistant. Many systems currently labeled as HIL are, in reality, AI2L systems.
Human Roles are Reconsidered: HIL systems often treat humans as data-labeling oracles or sources of domain knowledge. This perspective overlooks the potential of humans to be active participants who significantly influence system performance. AI2L systems, in contrast, are human-centered, placing the human at the core of the system.
Evaluation Metrics Must Change: Traditional metrics like accuracy and precision are suitable for HIL systems, but AI2L systems require a human-centered approach to evaluation. This involves considering factors such as calibration, fairness, explainability, and the overall impact on the human user. Ablation studies are also essentialto evaluate the impact of different components on the overall AI2L system.
Bias and Trust are Different: HIL systems are prone to biases from historical data and human experts. AI2L systems are also susceptible to data and algorithmic biases but are more vulnerable to biases arising from how humans interpret AI outputs. Trust in HIL systems depends on the credibility of the human teachers, while trust in AI2L systems relies on transparency, explainability, and interpretability.
A Shift in Mindset is Necessary: Moving from HIL to AI2L involves a fundamental shift in how we approach AI system design and deployment. It means recognizing that AI is there to enhance human expertise, rather than replace it. This shift involves viewing AI deployment as an intervention within existing human-driven processes, and focusing on collaborative rather than purely automated solutions.