Category: TECH

  • “This isn’t ‘The Matrix'”

    “This isn’t ‘The Matrix'”

    [ad_1]

    Last weekend, Jeffrey Goldberg, editor-in-chief of The Atlantic, found himself at the center of a digital fiasco when he was unexpectedly added to a Signal group chat with 17 U.S. government officials who were discussing imminent airstrikes in Yemen. For some, the incident has raised questions about how phone numbers end up in contact lists […]

    [ad_2]

    Source link

  • Open source devs are fighting AI crawlers with cleverness and vengeance

    Open source devs are fighting AI crawlers with cleverness and vengeance

    [ad_1]

    AI web-crawling bots are the cockroaches of the internet, many software developers believe. Some devs have started fighting back in ingenuous, often humorous ways.

    While any website might be targeted by bad crawler behavior — sometimes taking down the site — open source developers are “disproportionately” impacted, writes Niccolò Venerandi, developer of a Linux desktop known as Plasma and owner of the blog LibreNews.

    By their nature, sites hosting free and open source (FOSS) projects share more of their infrastructure publicly, and they also tend to have fewer resources than commercial products.

    The issue is that many AI bots don’t honor the Robots Exclusion Protocol robot.txt file, the tool that tells bots what not to crawl, originally created for search engine bots.

    In a “cry for help” blog post in January, FOSS developer Xe Iaso described how AmazonBot relentlessly pounded on a Git server website to the point of causing DDoS outages. Git servers host FOSS projects so that anyone who wants can download the code or contribute to it.

    But this bot ignored Iaso’s robot.txt, hid behind other IP addresses, and pretended to be other users, Iaso said.

    “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more,” Iaso lamented. 

    “They will scrape your site until it falls over, and then they will scrape it some more. They will click every link on every link on every link, viewing the same pages over and over and over and over. Some of them will even click on the same link multiple times in the same second,” the developer wrote in the post.

    Enter the god of graves

    So Iaso fought back with cleverness, building a tool called Anubis. 

    Anubis is a reverse proxy proof-of-work check that must be passed before requests are allowed to hit a Git server. It blocks bots but lets through browsers operated by humans.

    The funny part: Anubis is the name of a god in Egyptian mythology who leads the dead to judgment. 

    “Anubis weighed your soul (heart) and if it was heavier than a feather, your heart got eaten and you, like, mega died,” Iaso told TechCrunch. If a web request passes the challenge and is determined to be human, a cute anime picture announces success. The drawing is “my take on anthropomorphizing Anubis,” says Iaso. If it’s a bot, the request gets denied.

    The wryly named project has spread like the wind among the FOSS community. Iaso shared it on GitHub on March 19, and in just a few days, it collected 2,000 stars, 20 contributors, and 39 forks. 

    Vengeance as defense 

    The instant popularity of Anubis shows that Iaso’s pain is not unique. In fact, Venerandi shared story after story:

    • Founder CEO of SourceHut Drew DeVault described spending “from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale,” and “experiencing dozens of brief outages per week.”
    • Jonathan Corbet, a famed FOSS developer who runs Linux industry news site LWN, warned that his site was being slowed by DDoS-level traffic “from AI scraper bots.”
    • Kevin Fenzi, the sysadmin of the enormous Linux Fedora project, said the AI scraper bots had gotten so aggressive, he had to block the entire country of Brazil from access.

    Venerandi tells TechCrunch that he knows of multiple other projects experiencing the same issues. One of them “had to temporarily ban all Chinese IP addresses at one point.”  

    Let that sink in for a moment — that developers “even have to turn to banning entire countries” just to fend off AI bots that ignore robot.txt files, says Venerandi.

    Beyond weighing the soul of a web requester, other devs believe vengeance is the best defense.

    A few days ago on Hacker News, user xyzal suggested loading robot.txt forbidden pages with “a bucket load of articles on the benefits of drinking bleach” or “articles about positive effect of catching measles on performance in bed.” 

    “Think we need to aim for the bots to get _negative_ utility value from visiting our traps, not just zero value,” xyzal explained.

    As it happens, in January, an anonymous creator known as “Aaron” released a tool called Nepenthes that aims to do exactly that. It traps crawlers in an endless maze of fake content, a goal that the dev admitted to Ars Technica is aggressive if not downright malicious. The tool is named after a carnivorous plant.

    And Cloudflare, perhaps the biggest commercial player offering several tools to fend off AI crawlers, last week released a similar tool called AI Labyrinth. 

    It’s intended to “slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect ‘no crawl’ directives,” Cloudflare described in its blog post. Cloudflare said it feeds misbehaving AI crawlers “irrelevant content rather than extracting your legitimate website data.”

    SourceHut’s DeVault told TechCrunch that “Nepenthes has a satisfying sense of justice to it, since it feeds nonsense to the crawlers and poisons their wells, but ultimately Anubis is the solution that worked” for his site.

    But DeVault also issued a public, heartfelt plea for a more direct fix: “Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.”

    Since the likelihood of that is zilch, developers, particularly in FOSS, are fighting back with cleverness and a touch of humor.

    [ad_2]

    Source link

  • A new, challenging AGI test stumps most AI models

    A new, challenging AGI test stumps most AI models

    [ad_1]

    The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Tuesday that it has created a new, challenging test to measure the general intelligence of leading AI models.

    So far, the new test, called ARC-AGI-2, has stumped most models.

    “Reasoning” AI models like OpenAI’s o1-pro and DeepSeek’s R1 score between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash score around 1%.

    The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares, and generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before.

    The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, “panels” of these people got 60% of the test’s questions right — much better than any of the models’ scores.

    a sample question from Arc-AGI-2 (credit: Arc Prize).

    In a post on X, Chollet claimed ARC-AGI-2 is a better measure of an AI model’s actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation’s tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on.

    Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on “brute force” — extensive computing power — to find solutions. Chollet previously acknowledged this was a major flaw of ARC-AGI-1.

    To address the first test’s flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization.

    “Intelligence is not solely defined by the ability to solve problems or achieve high scores,” Arc Prize Foundation co-founder Greg Kamradt wrote in a blog post. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, ‘Can AI acquire [the] skill to solve a task?’ but also, ‘At what efficiency or cost?’”

    ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, o3’s performance gains on ARC-AGI-1 came with a hefty price tag.

    The version of OpenAI’s o3 model — o3 (low) — that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task.

    Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2 (credit: Arc Prize).

    The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face’s co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, including creativity.

    Alongside the new benchmark, the Arc Prize Foundation announced a new Arc Prize 2025 contest, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.

    [ad_2]

    Source link

  • Meta settles UK ‘right to object to ad-tracking’ lawsuit by agreeing not to track plaintiff

    Meta settles UK ‘right to object to ad-tracking’ lawsuit by agreeing not to track plaintiff

    [ad_1]

    A human rights campaigner, Tanya O’Carroll, has succeeded in forcing social media giant Meta not to use her data for targeted advertising. The agreement is contained in a settlement to an individual challenge she lodged against Meta’s tracking and profiling back in 2022.

    O’Carroll had argued that a legal right to object to the use of personal data for direct marketing that’s contained in U.K. (and E.U.) data protection law, along with an unqualified right that personal data shall no longer be processed for such a purpose if the user objects, meant Meta must respect her objection and stop tracking and profiling her to serve its microtargeted ads.

    Meta refuted this — claiming its “personalized ads” are not direct marketing. The case had been due to be heard in the English High Court on Monday, but the settlement ends the legal action.

    For O’Carroll it’s an individual win: Meta must stop using her data for ad targeting when she uses its services. She also thinks the settlement sets a precedent that should allow others to confidently exercise the same right to object to direct marketing in order to force the tech giant to respect their privacy.

    Speaking to TechCrunch about the outcome, O’Carroll explained she essentially had little choice to agree to the settlement once Meta agreed to what her legal action had been asking for (i.e. not to process her data for targeted ads). Had she proceeded and the litigation failed, she could have faced substantial costs, she told us.

    “It’s a bittersweet victory,” she said. “In lots of ways I’ve achieved what I set out to achieve — which is to prove that the right to object exists, to prove that it applies exactly to a business model of Meta and many other companies on the internet — that targeted advertising is, in fact, direct marketing.

    “And I think I’ve shown that that’s the case. But, of course, it’s not determined in law. Mesa has not had to accept liability — so they can still say they just settled with an individual in this case.”

    While the E.U. has long had comprehensive legal protections in place for people’s information, such as the General Data Protection Regulation (GDPR) — the law O’Carroll’s legal challenge had hinged on — which the U.K.’s domestic data protection framework is still based on, enforcing these privacy laws against surveillance-based ad business models such as the one Meta operates has proven to be a painstaking and frustrating endeavor.

    Years of regulatory whack-a-mole have played out in relation to multiple GDPR complaints about the company since the regime came into force in May 2018.

    And while Meta has racked up quite a number of GDPR fines — including some of the largest ever privacy fines for tech — its core consentless surveillance business model has proven harder to shift. Although there are signs that enforcement action is finally chipping away at this position in Europe. And O’Carroll’s example underscores that privacy push-back is possible.

    “The thing that gives me hope is that the ICO [U.K.’s Information Commissioner’s Office] did intervene on the case and did very plainly — and incredibly convincing and persuasively — side with me,” O’Carroll added, suggesting that other Meta users who also take steps to object to its processing of their data may have a stronger chance of the ICO stepping in to support them if Meta denies their requests now.

    That said, she thinks the company will now likely shift to a “pay or consent” model in the U.K. — which is the legal basis it moved to in the EU last year. That requires users to either consent to tracking and profiling or pay Meta to access ad-free versions of its services.

    O’Carroll said she is unable to disclose full details of the tracking-free access Meta will be providing in her case but she confirmed that she will not have to pay Meta.

    [ad_2]

    Source link

  • Trump fires FTC commissioners, setting up a legal battle

    Trump fires FTC commissioners, setting up a legal battle

    [ad_1]

    President Trump fired the two Democratic members of the Federal Trade Commission (FTC) on Tuesday, setting up a challenge to a 1935 Supreme Court precedent prohibiting the firing of FTC commissioners for reasons other than “good cause.”

    The White House terminated commissioners Rebecca Kelly Slaughter and Alvaro Bedoya earlier Tuesday, The New York Times reported. In a statement, Slaughter called the firings “illegal.”

    “Today the president illegally fired me from my position as a federal trade commissioner, violating the plain language of a statute and clear Supreme Court precedent,” Slaughter said. “Why? Because I have a voice. And he is afraid of what I’ll tell the American people.”

    The FTC, which typically has five members, was established in 1914 and is charged with enforcing consumer protection and antitrust laws. The Trump administration has aggressively challenged the authority of independent regulatory agencies, including the FTC.

    [ad_2]

    Source link

  • Joby Aviation and Virgin Atlantic partner to launch electric air taxis in the UK

    Joby Aviation and Virgin Atlantic partner to launch electric air taxis in the UK

    [ad_1]

    Joby Aviation is partnering with Virgin Atlantic to launch electric air taxis in the U.K., marking the seventh country in which the startup hopes to one day commercialize.

    Joby, which went public in 2021 via special purpose acquisition merger, did not provide a timeline for when it plans to launch its partnership with Virgin in the U.K. A spokesperson for the company told TechCrunch it would come sometime after Joby launches in the UAE and the U.S.

    Joby hopes to begin market testing in Dubai late this year or early next after delivering its first eVTOL (electric vertical takeoff and landing) aircraft to the country. The startup had also planned to launch a commercial service in the U.S., either in New York or Los Angeles, in 2025, but that timeline may get pushed out as Joby works to get the necessary certifications from the Federal Aviation Administration. 

    In October 2024, Joby said it was close to receiving type certification — which signifies the approval of the vehicle’s design —  but a spokesperson today couldn’t provide an updated timeline.

    Joby will need to get its own certifications from the U.K. before it launches there, as well. The company applied to have its aircraft validated for use by the U.K. Civil Aviation Authority in July 2022.

    Joby’s tie-up with Virgin comes nearly seven months after TechCrunch first reported that the two companies had plans to work together — news we came by via one of our “little birds.” 

    Per the deal, Joby will be Virgin’s exclusive airline distribution partner in the U.K. The California-based company also has a mutually exclusive deal with one other airline, Delta, in the U.S. and U.K., but the Virgin partnership falls under that existing deal because Delta owns roughly half of Virgin. 

    Joby’s deal with Delta promises to allow customers to access a premium service that shuttles them from local vertiports directly to the airport. (Vertiports are infrastructure where eVTOLs takeoff, land, and charge.) The Virgin partnership promises a similar network of landing sites across the U.K., but it will start by connecting passengers from the airline’s hubs at London’s Heathrow and Manchester Airport. 

    According to the companies, Virgin customers will be able to reserve a seat on Joby’s aircraft in the future through the Virgin Atlantic app and website. 

    Partnering with airlines is one of the main ways eVTOL companies are planning to go-to-market. Joby’s main rival, Archer Aviation, has made similar deals with United and Southwest

    Many of those deals have included investment from the airlines. Delta, for example, has invested $60 million into Joby already, with the option to invest up to $200 million more if Joby delivers on its promises. An investment is not part of Joby’s deal with Virgin, according to a Joby spokesperson. 

    In a statement, Virgin said it would support Joby’s go-to-market efforts in the U.K. by marketing the service to customers, working with regulators, and helping to “build support for the development of landing infrastructure at key airports.” 

    Joby’s eVTOL is designed to carry a pilot, four passengers, and some luggage. It promises to fly at speeds of up to 200 miles per hour, making a flight from Leeds to Manchester a 15-minute journey. 

    The startup is a long way from large scale deployments, but Joby has stated its intentions to launch an air taxi service in the U.S., the U.K., the UAE, South Korea, Japan, India, and Australia. 

    [ad_2]

    Source link

  • As Intel welcomes a new CEO, a look at where the company stands

    As Intel welcomes a new CEO, a look at where the company stands

    [ad_1]

    Semiconductor giant Intel hired semiconductor veteran Lip-Bu Tan to be its new CEO. This news comes three months after Pat Gelsinger retired and stepped down from the company’s board, with Intel CFO David Zinsner and executive vice president of client relations Michelle Johnston Holthaus stepping in as co-CEOs.

    Tan, who was most recently the CEO of Cadence Design Systems, is joining Intel — and rejoining the board — at an interesting time in the Silicon Valley company’s history. Intel has seen its fair share of ups and downs in the past few years — to put it mildly.

    When Gelsinger took the helm in February 2021, Intel was already struggling and was falling far behind its peers in the semiconductor race. At the time, the company was likely still reeling from missing out on the smartphone revolution in addition to missteps when it came to chip fabrication.

    It was also an interesting time for the semiconductor industry at large. The sector had seen a lot of recent consolidation in late 2020, including AMD acquiring Xilink for $35 billion and Analog buying Maxim for $21 billion, among others.

    So how was Gelsinger’s most recent tenure at Intel? Let’s take a look.

    Gelsinger got right to work when he started. He announced a modernization plan for the company, dubbed IDM, or integrated device manufacturing. The first part of the goal was a $20 billion investment to build two new chip manufacturing facilities in Arizona, with plans to boost chip production in the U.S. and beyond.

    In 2022, the company announced the second part of this IDM plan, which involved a three-pronged approach to chip manufacturing: Intel’s fabs, third-party global manufacturers, and building out the company’s foundry services. As part of this plan, the company announced it would acquire Tower Semiconductor for $5.4 billion to help build out Intel’s custom foundry services.

    That deal fell through, however, after facing regulatory hurdles. It was canceled in the summer of 2023. At the time, TechCrunch reported that the merger not going through would have a serious impact on the company’s modernization plans. In September 2024, Intel took steps to transition its chip foundry division, Intel Foundry, to an independent subsidiary.

    The time leading up to Gelsinger’s retirement was particularly tumultuous for Intel. The company’s stock price plummeted about 50% from the beginning of 2024 to Gelsinger’s departure in December. Intel announced plans to lay off 15% of its workforce, around 15,000 people, in August after dismal second-quarter results. At that time, Gelsinger said the company had struggled to capitalize on the AI boom in the same way its rivals had, and that despite falling behind, Intel had overgrown headcount.

    In the time since Gelsinger’s departure, the company has delayed the opening of its Ohio chip factory — again — and decided not to bring its Falcon Shores AI chips to market.

    But as Tan takes the lead, things may be starting to head in the right direction. Intel finalized a deal with the U.S. Department of Commerce to receive a $7.865 billion grant for domestic semiconductor manufacturing through the U.S. Chips and Science Act; Intel has already received $2.2 billion of that grant money, according to its fourth-quarter earnings call. The company was also able to notch a win when it comes to the popularity of its Arc B580 graphics card, which sold out after positive early reviews.

    [ad_2]

    Source link

  • Manus probably isn’t China’s second ‘DeepSeek moment’

    Manus probably isn’t China’s second ‘DeepSeek moment’

    [ad_1]

    Manus, an “agentic” AI platform that launched in preview last week, is generating more hype than a Taylor Swift concert.

    The head of product at Hugging Face called Manus “the most impressive AI tool I’ve ever tried.” AI policy researcher Dean Ball described Manus as the “most sophisticated computer using AI.” The official Discord server for Manus grew to over 138,000 members in just a few days, and invite codes for Manus are reportedly selling for thousands of dollars on Chinese reseller app Xianyu.

    But it’s not clear the hype is justified.

    Manus wasn’t developed entirely from scratch. According to reports on social media, the platform uses a combination of existing and fine-tuned AI models, including Anthropic’s Claude and Alibaba’s Qwen, to perform tasks such as drafting research reports and analyzing financial filings.

    Yet on its website, Monica — the Chinese startup behind Manus — gives a few wild examples of what the platform supposedly can accomplish, from buying real estate to programming video games.

    In a viral video on X, Yichao “Peak” Ji, a research lead for Manus, implied that the platform was superior to agentic tools such as OpenAI’s deep research and Operator. Manus outperforms deep research on a popular benchmark for general AI assistants called GAIA, Ji claimed, which probes an AI’s ability to carry out work by browsing the web, using software, and more.

    “[Manus] isn’t just another chatbot or workflow,” Ji said in the video. “It’s a completely autonomous agent that bridges the gap between conception and execution […] We see it as the next paradigm of human-machine collaboration.”

    But some early users say that Manus is no panacea.

    Alexander Doria, the co-founder of AI startup Pleias, said in a post on X that he encountered error messages and endless loops while testing Manus. Other X users pointed out that Manus makes mistakes on factual questions and doesn’t consistently cite its work — and often misses information that’s easily found online.

    My own experience with Manus hasn’t been incredibly positive.

    I asked the platform to handle what seemed to me like a pretty straightforward request: order a fried chicken sandwich from a top-rated fast food joint in my delivery range. After about ten minutes, Manus crashed. On the second attempt, it found a menu item that met my criteria, but Manus couldn’t complete the ordering process — or provide a checkout link, even.

    Manus
    Trying to order fried chicken sandwiches with Manus is a frustrating experience.Image Credits:Manus

    Manus similarly whiffed when I asked it to book a flight from NYC to Japan. Given instructions that I thought didn’t leave much room for ambiguity (e.g. “look for a business-class flight, prioritizing price and flexible dates”), the best Manus could do was serve up links to fares across several airline websites and airfare search engines like Kayak, some of which were broken.

    Manus
    Manus can’t book flights to Tokyo for you just yet.Image Credits:Manus

    Hoping the next few tasks might be the charm, I told Manus to reserve a table for one at a restaurant within walking distance. It failed after a few minutes. Then I asked the platform to build a Naruto-inspired fighting game. It errored out half an hour in, which is when I decided to throw in the towel.

    We’ve reached out to Monica for comment and will update this post if we hear back.

    So if Manusis is falling short of its technical promises, why did it blow up? A few factors contributed, such as the exclusivity created by a scarcity of invites.

    Chinese media was quick to tout Manus as an AI breakthrough; publication QQ News called it “the pride of domestic products.” Meanwhile, AI influencers on social media spread misinformation about Manus’ capabilities. A widely-shared video showed a desktop program, ostensibly Manus, taking action across multiple smartphone apps. Ji confirmed that the video wasn’t, in fact, a demo of Manus.

    Other influential AI accounts on X sought to draw comparisons between Manus and Chinese AI company DeepSeek — comparisons not necessarily rooted in fact. Monica didn’t develop in-house models, unlike DeepSeek. And while DeepSeek made many of its technologies openly available, Monica hasn’t — at least not quite yet.

    To be fair to Monica, Manus is in early access. The company claims it’s working to scale computing capacity and fix issues as they’re reported. But as the platform currently exists, Manus appears to be a case of hype running ahead of technological innovation.



    [ad_2]

    Source link

  • SpaceX Starship spirals out of control in second straight test flight failure

    SpaceX Starship spirals out of control in second straight test flight failure

    [ad_1]

    SpaceX’s Starship spiraled out of control while in space during a test flight on Thursday, marking the second launch in a row that the vehicle has run into a fatal problem on its way to orbit.

    The company launched Starship using its Super Heavy booster and things looked normal for the first eight minutes of the flight prior to the problem. The ship successfully separated and headed into space, while the booster came back to the company’s launchpad in Texas, where it was caught for a third time by the launch tower.

    But at around eight minutes and nine seconds into the flight, SpaceX’s broadcast graphics showed Starship lose multiple Raptor engines on the vehicle. On-board footage showed the ship started spiraling end over end over the ocean.

    “We just saw some engines go out, it looks like we are losing attitude control of the ship,” SpaceX communications manager Dan Huot said on the broadcast. “At this point we have lost contact with the ship.”

    Footage posted to social media showed the ship breaking up over the Bahamas and the Dominican Republic a few minutes later. The company posted to X that it “immediately began coordination with safety officials to implement pre-planned contingency responses.”

    The high-profile back-to-back explosions come as SpaceX CEO Elon Musk has spent the last few weeks causing chaos across United States federal government with his so-called Department of Government Efficiency. That has included him deploying employees to the Federal Aviation Administration, which oversees SpaceX’s flights.

    SpaceX was hoping to deploy four dummy versions of its Starlink satellites during Thursday’s test flight, a step towards the goal of using Starship for commercial missions. The company has been purposely developing Starship by doing test flights in rapid succession, and learning from the things that go both right and wrong.

    But Thursday’s failure comes just a few weeks after the seventh test flight, which saw Starship break up in spectacular fashion over the islands of Turks & Caicos, which caused the FAA to divert a number of flights in that airspace.

    SpaceX performed what’s known as a “mishap investigation” into that failure. The company determined propellant was leaking inside Starship, which caused fires and a communications blackout with the ship before it self-destructed.

    Ahead of this test flight, SpaceX said it made improvements to the lines that send fuel to Starship’s engines and changed the temperature of the propellant. It also added extra vents and “a new purge system” to better hedge against any leaks.

    On some of its previous test flights, SpaceX saw its Starship break up as it attempted to re-enter the Earth’s atmosphere. The company rolled out changes on the seventh test flight that were supposed to help it learn how to better prepare the ship to survive that re-entry.

    “With Flight 8, we’re focused on finding the real-world limits of Starship so we can prepare to eventually return Starship to the launch site and catch it,” the company wrote on X on Thursday.

    [ad_2]

    Source link

  • People are using Super Mario to benchmark AI now

    People are using Super Mario to benchmark AI now

    [ad_1]

    Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

    Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.

    It wasn’t quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.

    Super Mario Bros. AI benchmark
    Image Credits:Hao Lab

    GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, “If an obstacle or enemy is near, move/jump left to dodge” and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.

    Still, Hao says that the game forced each model to “learn” to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that reasoning models like OpenAI’s o1, which “think” through problems step by step to arrive at solutions, performed worse than “non-reasoning” models, despite being generally stronger on most benchmarks.

    One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while — seconds, usually — to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to your death.

    Games have been used to benchmark AI for decades. But some experts have questioned the wisdom of drawing connections between AI’s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.

    The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an “evaluation crisis.”

    “I don’t really know what [AI] metrics to look at right now,” he wrote in a post on X. “TLDR my reaction is I don’t really know how good these models are right now.”

    At least we can watch AI play Mario.

    [ad_2]

    Source link