SentenceTransformer based on Alibaba-NLP/gte-modernbert-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-modernbert-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-modernbert-base
  • Maximum Sequence Length: 128 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("redis/model-a-baseline")
# Run inference
sentences = [
    'How do you earn money on Quora?',
    'What is the best way to make money on Quora?',
    'What are some things new employees should know going into their first day at Maximus?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.9926, -0.0087],
#         [ 0.9926,  1.0000, -0.0135],
#         [-0.0087, -0.0135,  1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.8297
cosine_accuracy@3 0.903
cosine_accuracy@5 0.9308
cosine_precision@1 0.8297
cosine_precision@3 0.301
cosine_precision@5 0.1862
cosine_recall@1 0.8297
cosine_recall@3 0.903
cosine_recall@5 0.9308
cosine_ndcg@10 0.8942
cosine_mrr@1 0.8297
cosine_mrr@5 0.8688
cosine_mrr@10 0.8729
cosine_map@100 0.8751

Training Details

Training Dataset

Unnamed Dataset

  • Size: 359,997 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 4 tokens
    • mean: 15.4 tokens
    • max: 47 tokens
    • min: 4 tokens
    • mean: 15.47 tokens
    • max: 47 tokens
    • min: 5 tokens
    • mean: 16.9 tokens
    • max: 125 tokens
  • Samples:
    anchor positive negative
    Shall I upgrade my iPhone 5s to iOS 10 final version? Should I upgrade an iPhone 5s to iOS 10? Whether extension of CA-articleship is to be served at same firm/company?
    Is Donald Trump really going to be the president of United States? Do you think Donald Trump could conceivably be the next President of the United States? Since solid carbon dioxide is dry ice and incredibly cold, why doesn't it have an effect on global warming?
    What are real tips to improve work life balance? What are the best ways to create a work life balance? How do you open a briefcase combination lock without the combination?
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 7.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 40,000 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 6 tokens
    • mean: 15.68 tokens
    • max: 72 tokens
    • min: 6 tokens
    • mean: 15.75 tokens
    • max: 72 tokens
    • min: 6 tokens
    • mean: 16.95 tokens
    • max: 78 tokens
  • Samples:
    anchor positive negative
    Why were feathered dinosaur fossils only found in the last 20 years? Why were feathered dinosaur fossils only found in the last 20 years? Why are only few people aware that many dinosaurs had feathers?
    If FOX News is the conservative news station, which cable news network is for liberals/progressives? If FOX News is the conservative news station, which cable news network is for liberals/progressives? How much did Fox News and conservative leaning media networks stoke the anger that contributed to Donald Trump's popularity?
    How can guys last longer during sex? How do I last longer in sex? What is a permanent solution for rough and puffy hair?
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 7.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 2e-05
  • weight_decay: 0.0001
  • max_steps: 5000
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_drop_last: True
  • dataloader_num_workers: 1
  • dataloader_prefetch_factor: 1
  • load_best_model_at_end: True
  • optim: adamw_torch
  • ddp_find_unused_parameters: False
  • push_to_hub: True
  • hub_model_id: redis/model-a-baseline
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0001
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3.0
  • max_steps: 5000
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: True
  • dataloader_num_workers: 1
  • dataloader_prefetch_factor: 1
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: False
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: True
  • resume_from_checkpoint: None
  • hub_model_id: redis/model-a-baseline
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss val_cosine_ndcg@10
0 0 - 2.1886 0.8939
0.0889 250 0.9461 0.4116 0.8933
0.1778 500 0.3963 0.3836 0.8933
0.2667 750 0.3776 0.3710 0.8936
0.3556 1000 0.3677 0.3640 0.8934
0.4445 1250 0.3581 0.3581 0.8932
0.5334 1500 0.3577 0.3543 0.8936
0.6223 1750 0.3521 0.3512 0.8939
0.7112 2000 0.3488 0.3485 0.8944
0.8001 2250 0.3464 0.3463 0.8942
0.8890 2500 0.3461 0.3439 0.8948
0.9780 2750 0.3445 0.3428 0.8950
1.0669 3000 0.3279 0.3421 0.8936
1.1558 3250 0.3233 0.3416 0.8937
1.2447 3500 0.3221 0.3406 0.8937
1.3336 3750 0.3219 0.3397 0.8939
1.4225 4000 0.3195 0.3391 0.8938
1.5114 4250 0.32 0.3386 0.8942
1.6003 4500 0.3209 0.3384 0.8943
1.6892 4750 0.3192 0.3381 0.8941
1.7781 5000 0.3203 0.3379 0.8942

Framework Versions

  • Python: 3.10.18
  • Sentence Transformers: 5.2.0
  • Transformers: 4.57.3
  • PyTorch: 2.9.1+cu128
  • Accelerate: 1.12.0
  • Datasets: 4.4.2
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
60
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for redis/model-a-baseline

Finetuned
(21)
this model

Papers for redis/model-a-baseline

Evaluation results