Matryoshka Representation Learning
Paper • 2205.13147 • Published • 25
This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large-instruct. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("FareedKhan/just_for_testing_model")
# Run inference
sentences = [
'\n\nThe gene in question appears to have a multifaceted role and involvement in various biological processes, diseases, and anatomical structures, with implications for both physiology and pathology. Here is a summary of its characteristics:\n\n### Function and Interactions\n- **Name**: mTORC1, a component of the mammalian target of rapamycin complex 1.\n- **Role**: Involved in regulation of membrane potential',
'Identify genes or proteins that interact with KCNMB1 and share an associated phenotype or effect.',
'Which solid-state medications specifically engage with the METAP2 gene/protein through direct interaction?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
dim_768InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.401 |
| cosine_accuracy@3 | 0.4604 |
| cosine_accuracy@5 | 0.4901 |
| cosine_accuracy@10 | 0.5446 |
| cosine_precision@1 | 0.401 |
| cosine_precision@3 | 0.1535 |
| cosine_precision@5 | 0.098 |
| cosine_precision@10 | 0.0545 |
| cosine_recall@1 | 0.401 |
| cosine_recall@3 | 0.4604 |
| cosine_recall@5 | 0.4901 |
| cosine_recall@10 | 0.5446 |
| cosine_ndcg@10 | 0.465 |
| cosine_mrr@10 | 0.4406 |
| cosine_map@100 | 0.4488 |
positive and anchor| positive | anchor | |
|---|---|---|
| type | string | string |
| details |
|
|
| positive | anchor |
|---|---|
|
Could you suggest some effective medications for acute diarrhea? |
|
Which gene or protein is consistently not expressed in the mucosal tissues of the mouth and the small intestine? |
|
Which genes or proteins exhibit interactions with HNRNPU, share an association with its related disease(s), and participate in the peroxisomal beta-oxidation process of fatty acid metabolism? |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768
],
"matryoshka_weights": [
1
],
"n_dims_per_step": -1
}
eval_strategy: epochper_device_train_batch_size: 2learning_rate: 1e-05num_train_epochs: 2warmup_ratio: 0.1bf16: Truetf32: Falseload_best_model_at_end: Trueoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: epochprediction_loss_only: Trueper_device_train_batch_size: 2per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 1e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 2max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Falselocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | dim_768_cosine_map@100 |
|---|---|---|---|
| 0 | 0 | - | 0.2774 |
| 0.0220 | 10 | 0.7928 | - |
| 0.0441 | 20 | 0.7435 | - |
| 0.0661 | 30 | 0.6181 | - |
| 0.0881 | 40 | 0.5851 | - |
| 0.1101 | 50 | 0.4896 | - |
| 0.1322 | 60 | 0.5216 | - |
| 0.1542 | 70 | 0.3562 | - |
| 0.1762 | 80 | 0.4002 | - |
| 0.1982 | 90 | 0.286 | - |
| 0.2203 | 100 | 0.3835 | - |
| 0.2423 | 110 | 0.3237 | - |
| 0.2643 | 120 | 0.5041 | - |
| 0.2863 | 130 | 0.4061 | - |
| 0.3084 | 140 | 0.3758 | - |
| 0.3304 | 150 | 0.4442 | - |
| 0.3524 | 160 | 0.3714 | - |
| 0.3744 | 170 | 0.4349 | - |
| 0.3965 | 180 | 0.3492 | - |
| 0.4185 | 190 | 0.1045 | - |
| 0.4405 | 200 | 0.2965 | - |
| 0.4626 | 210 | 0.1913 | - |
| 0.4846 | 220 | 0.4259 | - |
| 0.5066 | 230 | 0.4671 | - |
| 0.5286 | 240 | 0.4812 | - |
| 0.5507 | 250 | 0.2442 | - |
| 0.5727 | 260 | 0.157 | - |
| 0.5947 | 270 | 0.4386 | - |
| 0.6167 | 280 | 0.0979 | - |
| 0.6388 | 290 | 0.7879 | - |
| 0.6608 | 300 | 0.073 | - |
| 0.6828 | 310 | 0.252 | - |
| 0.7048 | 320 | 0.3913 | - |
| 0.7269 | 330 | 0.1331 | - |
| 0.7489 | 340 | 0.1311 | - |
| 0.7709 | 350 | 0.3487 | - |
| 0.7930 | 360 | 0.2204 | - |
| 0.8150 | 370 | 0.1718 | - |
| 0.8370 | 380 | 0.4277 | - |
| 0.8590 | 390 | 0.4798 | - |
| 0.8811 | 400 | 0.1381 | - |
| 0.9031 | 410 | 0.4986 | - |
| 0.9251 | 420 | 0.2379 | - |
| 0.9471 | 430 | 0.2717 | - |
| 0.9692 | 440 | 0.5997 | - |
| 0.9912 | 450 | 0.2738 | - |
| 1.0 | 454 | - | 0.4476 |
| 1.0132 | 460 | 0.0649 | - |
| 1.0352 | 470 | 0.1113 | - |
| 1.0573 | 480 | 0.0916 | - |
| 1.0793 | 490 | 0.0866 | - |
| 1.1013 | 500 | 0.1341 | - |
| 1.1233 | 510 | 0.1591 | - |
| 1.1454 | 520 | 0.0737 | - |
| 1.1674 | 530 | 0.2395 | - |
| 1.1894 | 540 | 0.051 | - |
| 1.2115 | 550 | 0.1838 | - |
| 1.2335 | 560 | 0.0741 | - |
| 1.2555 | 570 | 0.2529 | - |
| 1.2775 | 580 | 0.1624 | - |
| 1.2996 | 590 | 0.1957 | - |
| 1.3216 | 600 | 0.1015 | - |
| 1.3436 | 610 | 0.056 | - |
| 1.3656 | 620 | 0.0592 | - |
| 1.3877 | 630 | 0.2027 | - |
| 1.4097 | 640 | 0.0874 | - |
| 1.4317 | 650 | 0.144 | - |
| 1.4537 | 660 | 0.2371 | - |
| 1.4758 | 670 | 0.083 | - |
| 1.4978 | 680 | 0.1608 | - |
| 1.5198 | 690 | 0.1924 | - |
| 1.5419 | 700 | 0.1765 | - |
| 1.5639 | 710 | 0.0068 | - |
| 1.5859 | 720 | 0.1316 | - |
| 1.6079 | 730 | 0.1538 | - |
| 1.6300 | 740 | 0.1136 | - |
| 1.6520 | 750 | 0.1216 | - |
| 1.6740 | 760 | 0.2417 | - |
| 1.6960 | 770 | 0.1868 | - |
| 1.7181 | 780 | 0.2164 | - |
| 1.7401 | 790 | 0.1186 | - |
| 1.7621 | 800 | 0.0155 | - |
| 1.7841 | 810 | 0.033 | - |
| 1.8062 | 820 | 0.024 | - |
| 1.8282 | 830 | 0.2094 | - |
| 1.8502 | 840 | 0.0761 | - |
| 1.8722 | 850 | 0.0876 | - |
| 1.8943 | 860 | 0.308 | - |
| 1.9163 | 870 | 0.0557 | - |
| 1.9383 | 880 | 0.2808 | - |
| 1.9604 | 890 | 0.0886 | - |
| 1.9824 | 900 | 0.2489 | - |
| 2.0 | 908 | - | 0.4488 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
intfloat/multilingual-e5-large-instruct
from sentence_transformers import SentenceTransformer model = SentenceTransformer("FareedKhan/intfloat_multilingual-e5-large-instruct_FareedKhan_prime_synthetic_data_2k_2_4") sentences = [ "\n\nThe gene in question appears to be involved in multiple cellular processes, many of which are central to neuronal function and health, especially in the context of neurodegenerative diseases. Here's a brief overview of its functions and context:\n\n### Key Functions:\n1. **Transcription Regulation**: Involved in RNA polymerase II transcription and regulation of gene expression.\n2. **Protein Processing**: Positive regulation of proteasomal ubiquitin-dependent protein catabolic process, indicating it might play a role in the degradation and recycling of proteins.\n3. **Cellular Stress Response**: Regulation of positive transcription by p53 (a known DNA damage response gene), positive regulation of I-kappaB kinase/NF-kappaB signaling (involved in inflammatory response), and negative regulation of cell death under oxidative stress.\n4. **Cellular Repair and Maintenance**: Autophagy of mitochondria (self-eating of organelles to clear damaged components), regulated the negative regulation of intrinsic apoptotic signaling pathways, facilitating cell survival rather than death.\n5. **Neurotransmitter and Ion Handling**: Involvement in dopamine secretion, response to manganese ion, and within synaptic transmission processes.\n6. **Metabolic Activities**: Influences glucose metabolism by regulation of glucokinase activity.\n\n### Context Specific:\n- **Manganese Exposure**: This gene's role in transcriptional regulation is particularly implicated in the context of manganese exposure. Manganese can be neurotoxic, particularly affecting the nervous system. Its regulation might help in the cellular response to manganese toxicity, including signaling pathways that", "Identify genes or proteins that interact with CLDN11 and are also implicated in the same medical condition.", "Search for ailments that have no drugs indicated for treatment and have a connection to Dermatographic urticaria.", "Is there an interaction between the parkin RBR E3 ubiquitin protein ligase and the DNA-damage-inducible transcript 4 (DDIT4), and if so, what biological effects or phenotypes have been associated with this interaction?" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4]