Astro - Hacker News

27 comments

pu_pe 2 hours ago

I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
[-]
- iugtmkbdfil834 6 minutes ago
  
  I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.
swiftcoder 2 hours ago

It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.
[-]
- fy20 34 minutes ago
  
  European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.
  https://en.wikipedia.org/wiki/List_of_languages_by_number_of...
  [-]
  - depaulagu 30 minutes ago
    
    > European Portuguese is the 13th most populous language in Europe
    that's not impressive
    
    [-]
    
    senko 7 minutes ago
    
    Hello from 23rd
- embedding-shape 2 hours ago
  
  I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.
  All in all, I don't think that's a major issue here.
  [-]
  - swiftcoder 2 hours ago
    
    The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).
    I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
    
    [-]
    
    madaxe_again an hour ago
    
    Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.
    Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.
    
    philipwhiuk an hour ago
    
    > I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).
    That's easy to say when you're not on the other end of US defaultism.
  - mghackerlady an hour ago
    
    Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources
  - KK7NIL 2 hours ago
    
    The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.
    
    [-]
    
    embedding-shape 2 hours ago
    
    Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.
    
    [-]
    
    KK7NIL 2 hours ago
    
    What's your evidence for that?
    And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?
    
    [-]
    
    embedding-shape an hour ago
    
    Evidence? Not so much, I didn't realize I was defending a PhD thesis here.
    I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.
    > And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English
    I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.
    
    [-]
    
    KK7NIL an hour ago
    
    > I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.
    I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)
    > Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.
    Not only are your reasons not obvious, your conclusion is actually wrong.
    If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.
    LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).
  - madaxe_again 2 hours ago
    
    Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.
    
    [-]
    
    embedding-shape 2 hours ago
    
    I agree, they're not the same. But they're far closer than other languages who don't come from the same families.
algoth1 2 hours ago

Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?
hartator 2 hours ago

What a waste of time and money.
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
[-]
- embedding-shape 2 hours ago
  
  What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.
  Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
  [-]
  - Miraste 2 hours ago
    
    To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.
  - numpad0 an hour ago
    
    yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.
  - cess11 2 hours ago
    
    E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.
    It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
- mistrial9 2 hours ago
  
  > makes you missed out on most of the world knowledge
  and, who knows what will happen to grammar ?
- KK7NIL an hour ago
  
  This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.
simianwords an hour ago

Domain specific models will never be a thing. You don't get generalised intelligence with that.
https://simianwords.bearblog.dev/why-domain-specific-llms-wo...